WO2008017103A1

WO2008017103A1 - Optimisation of a scoring function

Info

Publication number: WO2008017103A1
Application number: PCT/AU2007/001047
Authority: WO
Inventors: Alex Smola; Quoc Le; Tiberio Caetano
Original assignee: National ICT Australia Ltd
Current assignee: Data61
Priority date: 2006-08-10
Filing date: 2007-07-27
Publication date: 2008-02-14
Anticipated expiration: 2009-02-10

Abstract

The invention concerns the optimisation of a scoring function. Examples of applications of a scoring function is by a search engine for ranking webpages that satisfy a user query according by relevance of each webpage, or to solve a graph matching problem. For a query and a set of items, the scoring function is applied to provide an ordered list of items. Next, the inaccuracy of the ordered list is determined by comparing the ordered list to a corresponding predetermined preferred ordered list. Then the aggregated loss of the inaccuracy is calculated. Finally, aggregated loss is approximately minimised by minimising the regularised convex upper bound of the aggregated loss using convex minimisation in order to optimise the scoring function. By learning the optimal permutation of ranked items the resulting method is more computationally efficient and more accurate. The invention concerns a method, software and computer system.

Description

Optimisation of a Scoring Function

Technical Field The invention concerns the technical field of optimising a scoring function. Performance of the scoring function can be measured using specific performance measures. For example, but not limited to, the invention concerns a scoring function used by an Internet Search Engine to rank webpages that satisfy a user query according to the relevance of each webpage. The invention concerns a method, software and computer system for optimising a scoring function.

Background Art

Ranking, and in particular webpage ranking, has long been a fertile research topic of machine learning. Ranking generally can be considered an attempt to learn an ordering of items, for example webpages.

Performance of a scoring function that is used to score items that are then ranked (i.e. ordered) is typically assessed using measures developed in the information retrieval community such as Normalised Discounted Cumulative Gain (NDCG) or Mean Reciprocal Rank (MRR) or Winner Takes All (WTA) or Expected Rank Utility (ERU), etc. They are used to address the issue of correctly evaluating the rankers, search engines or recommender systems.

In past years, many ranking approaches have been proposed. Ordinal regression using a Support Vector Machines (SVM)-like large margin method and Neural Networks were some of the first approaches. This was followed by Perceptrons, and online methods. The state of the art is essentially to describe the partial order by a directed acyclic graph and to associate the cost incurred by ranking incorrectly with the edges of this graph. These methods aim to find the best ordering function over the returned documents. However, it is difficult to express complex measures (such as Normalised

Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR)) in terms of this framework.

Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention. It is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.

Summary of the Invention

In a first aspect, the invention provides a method of optimising a scoring function comprising the steps of:

(a) for a query and a set of items, applying the scoring function to provide an ordered list of items; (b) determining the inaccuracy of the ordered list by comparing the ordered list to a corresponding predetermined preferred ordered list;

(c) calculating an aggregated loss of the inaccuracy; and

(d) approximately minimising the aggregated loss by minimising the regularised convex upper bound of the aggregated loss using convex minimisation, and thereby optimising the scoring function.

The step of determining the inaccuracy of the ordered list may comprise calculating a performance measure for the ordered list; and the step of calculating the aggregated loss of the inaccuracy may comprise calculating the aggregated loss for the performance measure.

The performance measure may be one of the following: Winner Takes All (WTA); Mean Reciprocal Rank (MRR); Discounted Cumulative Gain (DCG and DCG@k);

Normalised Discounted Cumulative Gain (NDCG and NGCD@k); Precision@k; or Expected Rank Utility.

The performance measure may be multivariate.

The method may be performed for multiple pairs of a query and set of items and the aggregated loss is the sum of the loss of the performance measure for each set of items.

The loss of a performance measure for a query and a set of items is the inner product of values determined by the performance measure on the ordered list and values determined by the performance measure on the corresponding predetermined preferred ordered list. The loss of a performance measure for a query and a set of items may be given by:

A(π, yy.= a(l)^τb(y) - a(π)^τb(y) where y is sorted in decreasing order, a relates to the position of an item in the ordered list as given by the performance measure, b relates to the score of the item as given by the performance measure, π is the permutation of the set and 1 denotes the identity permutation.

The step of calculating the performance measure of the ordered list may comprise calculating the performance measure of only a portion of the list, such as the highest scoring or lowest scoring items.

The scoring function may provide a score for each item, and the ordered list of items is based on the score for each item. The scoring function may be position dependent and/or may be a member of any one of the function classes neural networks, decision trees or rule based.

Where the set of items comprises a first graph where each item is a node of the graph, the scoring function may produce a new label for each node and the items are ordered according to the new label. The new label of the node may be based on the position of the node within the first graph and the edges between nodes in the ordered list.

Minimising the regularised convex upper bound of the aggregated loss may be defined by an optimisation problem having in its objective function C times an upper bound ε ^h> on the losses incurred by the query and the set of items plus a regulariser - ² which ensures that the scoring function is smooth. The optimisation problem may further comprise inequalities that ensure that the nonconvex loss ^ ' '^¹ ' is upper bounded in a convex fashion.

The optimisation problem may be given by:

minimiseⁱ||w||² + C∑ ξ, subject to w^τ (Φ(D_»q,,l) - Φ(P_»q_»π,)) ≥ A(π,y,) - ξ,. (I)

^W'^ξ ι=l for all ξ, > 0 and π, e Yl and 1 < i < / where 0(D₁,^, ;r,) are a so-called feature map, w is a parameter vector and / is the number of queries or sets of items. The minimisation of the regularised convex upper bound of the aggregated loss may be carried out directly via equation (I) and includes finding the weakest aspect of the current choice of the scoring function on the preferred ordered list.

Minimising the regularised convex upper bound of the aggregated loss may comprise optimising Wolfe's dual optimisation problem using Lagrange multipliers. The dual optimisation problem may be given by: minimisej ∑ ∑ a_ιπa_jπ.Φ ^τ{D,,q_oπ)Φ{D_{J tq]},π') - ∑ ∑ a,A(*>y,) (^IIa)

subject to ∑ a_ιπ < C and a_m ≥ 0 for all 1 < i < /. (lib) πeU

Minimising the regularised convex upper bound of the aggregated loss may include repeatedly finding the weakest aspect of a current choice of the scoring function and minimising the aggregated loss based on this weak aspect of the scoring function. Minimising the aggregated loss based on the weak aspect of the scoring function may be performed using column generation.

The step of approximately minimising the aggregated loss by minimising the regularised convex upper bound of the aggregated loss may comprise using linear assignment.

The regularised convex upper bound of the aggregated loss may include dealing with a large number of constraints and the further comprises finding an ordered list that provides the maximal violator of the constraints and is solved by linear or quadratic assignment. The linear assignment problem may be solved using the Hungarian Marriage method.

When an ordered list is found not to be the maximal violator of the constraint, the method may comprise decreasing the regularised upper bound on the loss by quadratic programming. The step of approximately minimising the aggregated loss by minimising the regularised convex upper bound of the aggregated loss may comprise using approximation algorithms, such as the Graduated Assignment algorithm.

The optimised scoring function may be applied to an Internet Search Engine to rank the display of web-pages that match a user's query. Alternatively, the optimised scoring function may be applied to information retrieval tasks, collaborative filtering, recommender systems, authorship identification queries, ranking advertisements or document summarisation tasks.

In yet a further alternative, the optimised scoring function may be applied to a graph matching problem. In this case, the preferred ordered list is comprised of nodes of a first graph, and the set of items are nodes of a second graph, where the each node of the second graph is to be matched to a node of the first graph. Further, the query is based on each node of the second graph.

Predetermined preferred ordered list may be part of a training data set, such that each predetermined preferred ordered list has a corresponding query and score for each item in the list.

Using the optimised scoring function steps (a) to (d) may be repeated on the same or different pair of a query and set of items.

Minimisation may be carried out by using a query and a set of items one at a time. Alternatively, the minimisation may be carried out using all the training data at once.

In a further aspect, the invention provides a computer system programmed to perform the method described above.

In yet a further aspect, the invention provides application software, that when installed on a computer system, is able to control the computer system to perform the method described above.

The invention concerns directly optimising ranking measures by learning the optimal permutation of ranked items (such as using the Hungarian Marriage method and structured SVM estimation). This provides the advantage that this method is more computationally efficient (does not spend time optimising unneeded parts of the list) and more statistically efficient leading to better accuracy (optimising the right cost function of loss rather than maximising only a vaguely related performance measure).

Brief Description of the Drawings

An example of the invention will now be described with reference to the accompanying drawings, in which:

Fig. 1 is a simplified flowchart of the method of one embodiment of the invention; Fig. 2 shows a training data sample containing six documents and corresponding relevancy values;

Fig. 3 shows the possible ordering of the training data sample produced by a scoring function;

Fig. 4 shows the calculation for vectors a and b according to the performance measure DCG;

Fig. 5 shows the final calculation for the performance measure DCG;

Fig. 6 is pseudo code of the DORM algorithm (Algorithm 1);

Fig. 7 is a graph showing DORM vs. RankNet on synthetic data; and

Fig. 8 is a graph showing DORM vs. RankNet on a web search dataset, specifically 8(a) is a comparison between three methods at different truncation levels and 8(b) is a comparison between the three methods at different training sample sizes.

Best Modes of the Invention

An embodiment of the invention will now be described that relates to ranking of items. In this case the items (documents) are webpages and the query is a sample user query that could be received from the user via an Internet Search Engine.

An overview of the method of this embodiment will now be described with reference to Fig. 1.

Training data is provided as input 10, the input comprising a collection of queries, documents and their respective labels (scores), and the loss function to be used. The respective scores are used to provide the preferred order of the documents.

The method is performed on a computer system programmed by application software. The computer system provides input means to receive the training data, and processing means to perform the method in order to optimise the scoring function as set out here. The computer system also includes output means, such as a display device or printer, to provide to the user the results of the optimisation.

Using machine learning (i.e. supervised learning) we derive a scoring function /based on the training data. Firstly, we initialise the scoring function 12 by say, setting the weight vector (regulariser) w to 0.

The scoring function f(d,q) may be determined by the inner product between a weight vector w and a feature map φ(d,q) depending on a document d and a query q. The scoring function f(d,q) may be a member of a set of function classes such as neural networks, decision trees, or rule based systems. The scoring function f(d,q) may be obtained implicitly by a position dependent inner product between a weight vector w and a feature map φ_t{d,q) depending on a document d and a query q and a position i.

The scoring function / assigns a relevancy score to every document d from a set of documents D that matches a query q 14. The documents can then be ranked according to their relevancy score to give the permutation π (D, d, f) . The aim is to have the ranked order produced by the scoring function to be as similar to a predetermined ranked order of the set of documents D as described in the training data.

We can assess the similarity using performance measures, such as NDCG@k (where k is the number of top ranked documents that are assessed using the performance measure). Like other performance measures, NRDG@k is irreducibly multivariate, that is, the performance measure depends on the scores of all the documents d - in this case the scores of documents dj to d_k .

Using, NDCG@k we produce a performance measure that is between 1 and 0 (1 being the optimal solution) for every query (and therefore every permutation π) in the training data. The aim is to have every performance measure as close to 1 as possible. That is, we seek to minimise the difference between the performance measure and the optimum solution (the loss Δ(π,y) ). We do this by optimising the loss defined by:

A(π, y):= a(l)^τb(y) - a(π)^τb(γ) (III)

This formula is a generalisation of all the performance measures in the form of an inner product of a permuted list (which in this case is a(π) - which represents the ranked list according to the ranking function) and a fixed list (which in this case is b(y) - which represents the ranked list and corresponding relevancy scores according to the training set).

Optimisation of the scoring function/comprises solving equation (I). In the process of this we repeatedly need to find permutations which violate the constraints in equation (Ib) the most. These permutations are found by solving a linear assignment problem (i.e. by the Hungarian Marriage method) 14b. The coefficients of the linear assignment problem are given by the scoring function /and the loss A(π,y) which determine the value of placing document i at position y 14a.

That is, we seek to minimise this loss over all queries and document set pairs in the training data - that is the aggregated loss. First we construct a function that is a regularised upper bound of the aggregated loss. This is defined by equation (I).

We minimise the loss by looking at the dual problem of minimising the upper bound. The minimisation of the upper bound is carried out by minimising the problem defined in equation (II).

To deal with the large number of constraints column generation is used. This means that the optimisation problem can be solved by iteratively using a linear assignment problem. Using the Hungarian Marriage method one finds the maximal violator of the constraint. That is, the Hungarian marriage generates columns that provide an indication of how we can improve the parameter w. To find the maximal violator of the constraint in (II) we seek to solve

argmax w ^τ Φ(D,q,π)+ Δ(π,y). (IV) π e U

We are looking for the π where (IV) is the tightest. This problem can be solved efficiently when Φ(D,q,π) decomposes into m

Φ(D,q,π) = ∑c_π(ι)φ_π(ι)(d_ι,q)

Typically one will choose all feature maps to be identical functions.

Next, we determine whether π is a permutation which we have not observed before. If yes, this optimisation will provide us with an improved value for w. Given that f(d_l ,q) =< φ(d_l,q),w > , we optimise w to achieve the value for f(d_t,q) that is closest to the predetermined relevancy score of d, as defined in the training data and that is suitable for use with all queries 16.

If the permutation π is not the identity permutation or has been found previously then we decrease the regularised upperbound on the loss by quadratic programming using the new permutation π and all previously found permutations 18. In this way the optimisation problem is iteratively solved.

The method of the invention will now be described in further detail below with reference to Figs. 2 to 8.

Commercial rankers, search engines or recommender systems, usually employ a two- stage procedure to obtain results given a query q: In a first stage documents matching the query D := {d_\, . . . , d_m} are obtained and subsequently these documents are ranked. The job of the ranker is to sort the documents to meet some ranking measures as described below. This is often done by means of a score function ftd, q), which assigns a score to every document given the query (for computational efficiency it is not desirable to have/depend on {d_\, . . . , d_m) jointly). Performance of the ranker is typically measured by means of a set of labels Y := {y_\, . . . , y_m} with y, e [0, . . . , r], where 0 corresponds to 'irrelevant' and r corresponds to 'highly relevant'. This is shown in Fig. 2. Commercial search engines or recommender systems often use less than ten levels of relevance.

At training time we are given / instances of queries q,, document collections D₁ of cardinality m, and labels Y₁ with \Y,\ = \D,\. It is our goal to find some mapping/, such that the ranking of a new collection of documents d_\, . . . , d_m obtained by sorting fld,, q) agrees well with Fin expectation. For the documents d_\, . . . , dβ given in Fig. 2 a possible ranking produced by /is shown in Fig. 3 where the ranking is determined according to the relevancy score given to each document d_\, . . . , dβ and then the documents are arranged in order of relevance.

For convenience we denote by π(D, q, f) the permutation obtained for the collection D of documents d, given the query q and the scoring function / We will drop the arguments D, q, / wherever it is clear from the context. Moreover, given a vector v G R"¹ we denote by v(π) the permutation of v according to π, i.e. v(π), = v_π(,). Finally, without loss of generality we assume that y is sorted in descending order, i.e. most relevant documents first. That is, the identical permutation π = 1 will correspond to the "optimal" solution. This allows us to express indexing operations more easily.

Next, we must assess how well the produced ranking of Fig. 3 agrees with Y as shown in Fig. 2. One of many performance measures can be used. Many widely-used performance measures in the information retrieval community are irreducibly multivariate. By irreducibly multivariate we mean that these performance measures depend on the scores of all the documents (i.e. the ordering). For example, 'Winner Takes All' (WTA), Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalised Discounted Cumulative Gain (NDCG) all fulfil this property. The AUC score could be used for ranking purposes, but its disadvantage lies in the fact that it does not set the priority where to optimise the ranking.

The performance measures will now be discussed in more detail.

Winner Takes All (WTA)

If the first document is relevant, i.e. ifj(π)i = yi the performance score is 1, otherwise 0. We may rewrite this as WTA(π, y) = a(π)^τb(y) where α, = δ(i, 1) and b_t = δfø, yι). (V)

Note that there may be more than one document which is considered relevant.

Mean Reciprocal Rank (MRR) We assume that there exists only one top ranking document. In this case MRR is defined as

MRR(π, y) = a(π)^τb(y) where a_t = Vi and b_> = δ(i, 1 ). (VI)

In other words, the reciprocal rank is the inverse of the rank assigned to document d\, the most relevant document. MRR derives its name from the fact that this quantity is typically averaged over several queries (hence the mean).

Discounted Cumulative Gain (DCG and DCG@k)

The previous two scores use only a single entry of π, namely π(l), to assess the quality of the ranking. A more balanced score are Discounted Cumulative Gains. They are defined as DCG(π, y) = a(π)^τb(y) where α, = 1 /log(i + 1 ) and b_t = 2^y' + \. (VII)

Here it pays if a relevant document is retrieved with a high rank, as the coefficients α, are monotonically decreasing. Note that DCG takes all ranks into account. Variants of DCG are DCG@k, i.e. where a, ^■ = 1/ log(/ + 1) if i ≤ k and α, = 0 otherwise. That is, we only care about the n top ranking entries. In search engines the truncation level k is typically 10, as this constitutes the number of hits returned on the first page of a search.

Normalised Discounted Cumulative Gain (NDCG and NDCG@k) A downside of DCG is that its range depends on y (e.g. a collection containing lots of relevant documents will lead to a considerably larger value at optimality than one containing only irrelevant documents). Since y is already sorted, it is easy to see that DCG is maximised for π = 1, i.e. for the identity map. We define

N^DCG^(K,,^{) :}- 1 ^an^{d NDCG@}*⁽ _π, ,^{) :}= fgL . <^VIII)

This allows us to define

1 2^y' + 1

NDCG(π, y) = a(π)^τb(y) where α,- = _{log (i + 1)} and b_t = _{DCG (1} ^ . (IX)

Finally, NDCG@k, the measure which this paper focuses on, is given by a(π)^τb(y) where

2^» + l ,_γ.

*^< = DCG@*(M ^• (^X)

Precision@k

Note that this measure, too, can be expressed by a(π)^τb(y). Here we define

if y,- correct otherwise ^{• (X1)}

The main difference to NDCG is that Precision@k has no decay factor. This corresponds to weighing the top k answers equally.

Expected Rank Utility Another measure called Expected Rank Utility (ERU) for which an exponential decay is imposed in the top ranked items. It can be represented as 1 - I

ERU(π, y) := a(π)^τb(y) where a, = 2 "^" ' and b, = max(y, - d, 0). (XII)

Where J is a neutral vote and α is the viewing halflife. The normalised ERU can also be defined in a similar manner to NDCG. The (Normalised) ERU is often used in collaborative filtering for recommender systems where the lists of items are often very short.

To continue the example of Fig. 2 and Fig. 3, the ranking produced is evaluated according to NDCG. First a value for a and b must be calculated for each document.

This calculation is shown in Fig. 4 where a generally relates to the importance we we assign to a document's position in the ranked order and b relates to the true relevance value of the document as defined in the training data. A final performance score is then calculated according to the Fig. 5. The closer this score value is to 1 the better the ranking order of Fig. 3 is.

While WTA and MRR are important metrics to measure the 'goodness' of a search engine from the scientific point of view, generally NDCG@k better models a person's judgment of a search engine: the results in the first page are important, between them there should be a decay factor. NDCG has another advantage that it is more structured and more general than WTA and MRR. Meanwhile, for collaborative and content filtering, ERU is often preferred.

Optimisation is concerned with a minimisation of loss. Typical structured SVM estimation problems deal with the minimisation of a loss rather than maximising a performance measure. Hence we define the following modification to obtain a loss given vectors a and b corresponding to any of the performance measures above (using the fact that y is sorted in decreasing order, hence π = 1 is the optimal permutation) according to equation (III).

The important point is that the performance measures discussed here can be expressed as an inner product (that is a summation of corresponding a and b vectors).

Optimisation The Bound The key inequality we exploit in obtaining a convex upper bound on the problem of minimising the loss Δ(π, y) is the following lemma.

Lemma 1 Let/, g : X-* R for some domain X and denote by x/ := argmax_x≡χβx) the maximiser of f Assume that the minimiser ofg exists and denote it by x_g. Then ifβx_g) ^~βx) ≥ g(x) ^~ g(x_g) - ζfor all x e X and ξ > 0 we have ξ > g{xj) - g(x_g). The same inequality on ξ holds ifβχ_g) -f{x) ≥ 1 - ξ/(gW - g(x_g))for all x.

Proof

We only prove the first claim, as the second claim can be proven in the same fashion. Since the inequality needs to hold for x — x/ and, since by definition f[x_g) —βxj) ≥ 0, we have 0 > g{xj) - g(x_g) - ξ. Rearrangement proves the claim.

Note that this convex upper bound is tight for x/ = x_g and if the minimal ξ is chosen. Obviously we may drop g(x_g) if the minimum of g is 0. Lemma 1 allows us to cast a minimisation problem in terms of/into a convex one.

T. Joachims, A support vector method for multivariate performance measures in Proc. Intl. Conf Machine Learning, 2005 proposes an algorithm for learning multivariate performance measures based on this relaxation. However, the method suits only the single-query case (for example the ROCArea can be considered as one query with l! labels).

We are also able to apply the same strategy and obtain the following optimisation for the multivariate case according to equation (I).

Here FI denotes the set of all permutations. With some abuse of notations we refer in this case to all permutations on the sets {1, . . .w,} respectively. The discriminant functions are of the form n(D,q,w) = argmax_π(≡π w^τΦ(D,q,w). Clearly

]T £ > ]£ A(π(D,q,w),y) by virtue of Lemma 1. A similar problem can be obtained by rescaling the slack variables with Δ(π,,y). The dual problem of (I) is given by equations (Ha) and (lib).

The Featuremap So far we have completely avoided discussing the structure of φ(D,q,π). The initial requirement of having a scoring function / for each (di,q) pair implies that φ(D,q,π) should be additive in (d_t,q). Moreover, we want w^τφ(D,q,π) to be easily optimised with respect to π. Both can be achieved if we define φ(O,q,π) = ∑ c(π), φ(d, q) where c e R^m. (XIII)

In this case w^τφ(D,q,π) = c(π)^τpi where pi = w^τφ (d_hq). The problem of choosing φ (di,q) is ultimately problem-dependent, as we do not know in general what data type the documents and queries are composed of. We will discuss various choices below.

The advantage is that maximisation of c(π)^τp is achieved by simply sorting c such that it has the same order as p, as follows from the Polya-Littlewood-Hardy inequality.

Assume that c is stored in descending order. In this case we need that for the

* * maximiser π of c(π) p the entry K₁ to denote the rank of the Mh element of/?. For example, if p2 denotes the fourth largest entry in p we need π₂ = 4. This permutation is easily obtained by applying Quicksort to p, i.e. in O(m log m) time. To obtain only the k top-ranking terms we would need to expend only O(m + k log k) time, using a divide and conquer procedure analogous to Quicksort.

This leaves the issue of choosing c. In general we want to choose it such that a) the margin of (I) is maximised and that b) the average length \\φ(D,q,π)\\² is small for good generalisation.

Since φ is linear in c, we could employ a kernel optimisation technique to obtain an optimal value of c, such as those proposed in [2, 18]. While in principle not impossible, this leads rather inevitably to a highly constrained (at worst semidefmite) optimisation problem in terms of c and w.

The Algorithm

To deal with the huge number of constraints in (I) we use the approach of [20], that is, column generation. This requires us to find violators of the constraints at each step. In analogy to [20] we have an algorithm for 'Direct Optimisation of Ranking Measures' (DORM) (see Algorithm 1 of Fig. 6). The Algorithm 1 has good convergence properties. It follows from [20] that it terminates after adding at most max [2«Δ/ε, 8CAR²/ε²] steps, where Δ is an upper bound on the loss (it is 1 in the NDCG case), and R is an upper bound on || Φ(D,q,π)\\.

Hungarian Marriage

The Constraint Violators

One of the key ingredients in our problem is an efficient algorithm to find a permutation π which finds the maximal violator of the constraints of (I). This means that we need to solve the problem of equation (FV).

In general, this is an NP hard problem. For example, whenever the objective function depends on several entries of π, at a time. The latter is referred to as a quadratic assignment problem. However, in the case where the objective function can be written as ∑,c_ιπ, it is a so-called linear assignment problem which can be solved by the Hungarian Marriage method: the Kuhn-Munkres algorithm [16, 17] in cubic time.

Lemma 2

Provided that ψ(D,q,π) is given by (XIII), the problem (IV) is solved by a linear assignment problem of finding m argmax ^T C,,_π, where C_υ = C_jp, - O_jb,. (XfV) π G TI

Proof

By construction A(πy) = a(\)^τb(y) - α(π)^τb(y). Moreover, due to (XIII) also w^τΦ(D,q,π) = c(π)p. Evaluating terms proves the claim.

Note that there is a special case in which the problem (IV) can be solved by a simple sort: whenever α = c the problem reduces to maximising α(π)^τ(b(y) — p). This choice, however, is not always desirable as α may be rather degenerate, as pointed out above.

The Linear Assignment Problem

It is well known that there exists a convex relaxation of the problem of maximising £ (J into a linear problem which leads to an optimal solution [16]. More specifically, the linear assignment problem maximise tr W^rC subject to ∑W_V and ∑ W_y = \ and W_v e {0,1 }. (XV)

W can be relaxed by replacing W_υ e {0, 1 } to W_l} ≥ 0 without changing the solution. The reason is that the constraint matrix is unimodular, hence it will always lead to an integral solution. Its dual can be obtained as

minimise ∑Uι + Vι subject to U₁ + v, ≥ C₁₁. (XVI) u,v

To solve (XVI) several methods exist: one could use an interior point algorithm or the Hungarian Marriage method of H. W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly, 2:83-97, 1955. Alternatively, a computationally efficient algorithm and code, due to [15], are readily available. It uses modern techniques for computing the shortest path problem arising in (XVI). Note that the linear assignment problem can be solved in O(m³) time.

This means that we can check whether a particular set of documents and an associated query (D,,q,) satisfies the inequality constraints of the structured estimation problem (I). Hence we have the subroutine necessary to make the DORM algorithm work. In particular, this is the only subroutine we need to replace in SVMStruct [20].

The method can be programmed as software that can be installed on a computer system having a processor with the necessary processing power to perform the optimisation method described here. The computer system will also include input means, such as a network connection to receive the inputs on which to perform the invention, such as the training set. The computer system will also include storage means suitable for storing the software and to provide the storage necessary for performing the method such as storing the final optimised scoring function and results of the iterations. Of course, the computer system will also include output means such as a monitor to display the results to the user.

Experiments Web page ranking experiments

In this section, we perform experiments to compare our method with RankNet by C.J.C Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hulldender, Learning to rank using gradient descent, International Conference on Machine Learning ICML 22, 2005 which are, to the best of our knowledge, currently being used for a large commercial search engine. We also made an attempt to compare our method against methods described by Joachims. However, these methods appear to be too slow for very large datasets. While our implementation only takes 20 minutes for training (or per iteration in cross validation) for the web search dataset below, Joachims' methods take more than two or three days, which make cross validation and model selection highly unlikely. This is because, unlike methods proposed by Joachims, our method focuses on only a fraction of returned documents and thus achieves much faster training time. We also observe that while our implementation takes around 20 iterations to converge, Joachims' method takes more than 100 iterations to achieve convergence.

To demonstrate the merit of our method, in the experiments below, we only use a very basic setting: c, = and show that with this setting, we can easily achieve better log (i+l) performance results than other methods.

Choosing c, = a,, where a is defined as above has an additional benefit: in this case the linear assignment problem is even simpler, as it can be solved directly by sorting, even during the training phase. Note that during testing we can always solve the problem by a sort operation. Depending on the number of documents per query such a choice can accelerate training significantly, as we reduced an O{ ^m' ) operation to one which costs only O(τn, log w,) time.

Artificial data

For comparison purposes we follow the same experimental protocol as described in [5] and generate artificial random cubic polynomial data with five levels of relevance. The dataset consists of 5000 queries with 50 documents per query. We compare our method (directly optimising over NDCG) against RankNet-1 (a linear 1 layer Neural Network) and RankNet-2 (a Neural Network with 2 layers). To adjust parameters and compare models we use nested cross validation. That is, within each 90% subset used for estimation, we use 10-fold cross validation to adjust the model and then we use the remaining 10% to obtain an unbiased error estimate.

To demonstrate the merit of our method, we only use linear kernels in this experiment. That is, we are describing the same function class as RankNet-1. We plot the mean and standard errors and notice that our method does indeed outperform RankNet 1 and 2 layers for all truncation levels from 1 to 10 (see Fig. 7), that is, the number of relevant documents k used in NDCG@k. Note that we are more than 4% better than the linear model trained by the RankNet- 1 approach, even though the function classes are identical. This is a very significant gain.

Web search data

We carry out experiments on a real-world web search dataset consisting of 1000 queries for training, 1000 queries for testing and 1000 queries for validation. These sets are selected and provided from a larger pool of training sets for a commercial search engine. We used this particular protocol, in order to obtain comparable results to [5]. The average number of documents per queries is around 50. The ratio between documents that are labelled as Bad, Fair, Good, Excellent, Perfect on each set is around 75:17:15:2:1. The length of the feature vectors is 367.

In the first experiment, we use the full training set, train our model using the DORM algorithm, cross validate to select the best set parameters and the perform prediction on test data. We report the results for various truncation levels ranging from 1 to 10. The results are shown in Figs. 8(a) and 8(b).

Table 1: Comparison of expected rank utility measure for three methods. Results averaged over 100 trials with 100 training users, 2000 input users and 800 training items.

In the second experiment, we investigate the influence of the training set size by computing the error using the first 100, 300, 500, 700 or all 1000 queries for training. We repeat the above procedure for model selection and prediction and report the results for NDCG@10.

In both experiments, we observe that our method with linear kernel and Gaussian RBF kernel consistently exceeds RankNet 1 and RankNet-2 by around 1 and 2 percent (respectively) which is considered a very good improvement in the search industry. Collaborative and content filtering

A second application of our method is collaborative filtering (i.e. recommender systems). Again, to make the point that our improvement is not in an improved choice of a kernel but rather in the improved choice of a loss function, we follow the experimental setup, choice of kernels, and preprocessing of [I]. That is, we use the EachMovie dataset which consists of 72916 users and 2811983 ratings on 1628 movies. More specifically we use the combination of kernels between users, between movies, and joint kernels as suggested in Basilico and Hofmann.

We perform our experiments using DORM with expected rank utility using best types of combination for joint kernels reported in their first and second experiment. Table 1 shows the merit of DORM: it outperforms JRank and Prank.

Having used a kernel which is optimal for JRank we expect that optimising the kernel further would lead to better results, as there is no reason to assume that the model class optimal for JRank would be the best choice for DORM, too.

Proposed here is a general scheme for directly optimising ranking measures by learning the optimal permutation via the Hungarian Marriage method and structured SVM estimation. By optimising the performance scores directly, it excels methods which assume a reasonable directed acyclic graph in order to obtain rankings.

There are clear applications of the current work. We could use our method directly for information retrieval tasks or authorship identification queries. In the latter case, the query q_t would consist of a collection of documents written by one author.

The invention could also be used for advertising (i.e. ranking advertisement placements) and document summarisation (i.e. ranking relevant phrases in a document for summarisation). Online algorithms along the lines of S. Shalev-Shwartz and Y. Singer. Online learning meets optimisation along the lines of [19] can easily be accommodated. This should allow us to deal with massive datasets.

Even further applications concern authorship verifiers (q is a sample document, d_t are the possible documents written by the author), document retrieval {q is the topic and di are the images, documents, audio etc..) and medical applications (q is the medical record/health status, and di are the possible treatments/medications).

Yet a further embodiment relates to solving the so called "graph matching problem". In this embodiment the feature map Φ , the loss function Δ ₅ and the algorithm used to solve the column generation are different. All the rest are equal in this graph matching embodiment and for succinctness are not repeated here.

In the graph matching problem, one has two collections of objects, and each collection is called a graph. We can assume for simplicity that the two graphs have the same number of objects. We call the objects in a graph nodes, and they are distinguished by labels (e.g. node 1, node 2, etc.). It turns out that not only the nodes in a graph can have properties (features), but pairs of nodes in a graph can have certain properties too.

These pairs of nodes are called edges. Therefore, a graph is characterized by a set of nodes and edges connecting pairs of nodes, where both the nodes and the edges have features associated to them.

Now assume (without loss of generality) that the labels of the nodes in one of the graphs (say graph G ) are ¹^'--. ^N (assign different labels to arbitrary different nodes).

C C C So we will have the features for the nodes, say ' ' ² '^'"' ^N and features for the edges,

C C C say '" ¹²'""' ^m . Similarly, we can label the nodes of the other graph (say graph

G' ) with labels ^²'--' ™ (also, assign different labels to arbitrary different nodes), and

C similarly we have node features and edge features as before (i.e. ' , etc.).

The feature map The problem of graph matching consists in sorting the nodes in G in a way that the matching score * (analogous to the ranking score * ) is maximised. The matching score is also assumed to be linear, as in ranking, i.e. * ^^π' ' ' = * ' ' ' .

Here, ^π represents the permutation of the labels in G induced by the new sorting, i.e. is ^π^¹' the new rank of the node whose previous rank was ^l (^π has precisely the same semantics as in the original ranking formulation). ^/;r(/) is some similarity measure between the node feature ' in graph G _and node feature ^π(l) in graph G Similarly, '"^Wj'^(j) is some similarity measure between the edge feature ^IJ in graph

G and the edge feature ^π^^π^ in graph G Note that the essential difference from this problem and the problem of inferring the best ranking in the ranking formulation is that now there is a second term which does not depend only on the entries * and ^π^¹' , but also on J and ^π^' , i.e. a quadratic term in addition to the linear term.

Note that the entire term in brackets is precisely the equivalent to the global scoring in the ranking formulation. The function ^l(r(I) is equivalent to the local scoring function g in the ranking formulation. It can then be seen that the best analogy between the ranking formulation and the graph matching formulation is the following: the document set D represents one graph G while the query ^Q represents another graph, G . The individual documents ' in D correspond to individual nodes ' in G . To make the semantics map one to one, one needs to consider that the query q is "split" into a set of "sub-queries" ^' , which would then correspond to the nodes of G . Given this analogy between the two formulations (and apart from the quadratic term), the problems become equivalent. If we parameterize both ^c and " linearly, as c,_π0) = (^ΦΛG, ,G_π'_(l))) _and d_ιπU)jπU) = (_W2 ,φ₂ (G_v , G_π'_ωπ(j))) ^_h^ _{from (χvπχ the} feature map can be easily show to be

Φ(G,G',π) = ∑ Φ_\ (G₁ , G_π'_(l) ), ∑ φ₂ (G_v , G_π'_(l)π(j) )

(XVIII) where ^¹ and ^² are local feature maps that relate node and edge features respectively, and are problem-dependent. Note that ^¹ is analogous to ^ ^■ '"' in the ranking formulation. Note that (XVIII) above is in analogy to equation (20) in the ranking paper.

The loss function

The loss function for graph matching is much simpler than for ranking, and all types of scores (like NDCG, etc.) are not relevant in graph matching at all. All one cares about in graph matching is minimising the fraction of mismatches, so we can simply compute the loss as being the fraction of mismatches, i.e.:

A(y,π) = l -

M V"/ . (XIX) is the loss incurred when inferring the permutation ^π when the correct permutation is y The column generation algorithm

Solving problem given in equation (XVII) is NP-hard in general, since that is a quadratic assignment problem, in contrast to the linear assignment problem that one obtains in the ranking formulation (which is solvable efficiently). Therefore one needs to resort to approximate algorithms. Well-known Graduated Assignment algorithm [22] can be used for finding an approximate solution to the problem.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

References

[1] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering.

In Proc. Intl. Conf. Machine Learning, 2004.

[2] O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In S. Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural

Information Processing Systems 15, 2002.

[3] J. S. Breese, D. Heckerman, and C. Kardie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on

Uncertainty in Artificial Intelligence, pages 43-52, 1998. [4] CJ. C Burges. Ranking as learning structured outputs. In R. Herbrich S.

Agarwal, C. Cortes, editor, Learning to Rank, NIPS Workshop at NIPS*2005, 2005.

[5] C.J.C Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and

G. Hulldender. Learning to rank using gradient descent. In International Conference on

Machine Learning ICML 22, 2005. [6] Rich Caruana, Shumeet Baluja, and Tom Mitchell. Using the future to sort out the present: Rankprob and multitask learning for medical risk evaluation. In D. S.

Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information

Processing Systems 8, pages 959-965, Cambridge, MA, 1996. MIT Press.

[7] W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In Proc. Intl. Conf. Machine Learning, 2005. [8] K. Crammer. Online Learning for Complex Categorical Problems. PhD thesis,

Hebrew University of Jerusalem, 2005.

[9] K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural

Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. [10] K. Crammer and Y. Singer. Loss bounds for online category ranking. In Proc.

Annual Conf Computational Learning Theory, 2005.

[11] E.F. Harrington. Online ranking/collaborative filtering using the perceptron algorithm. In International Conference on Machine Learning ICML 20, 2003.

[12] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. J. Smola, P. L. Bartlett, B. Sch^' olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 115-132, Cambridge, MA, 2000.

MIT Press.

[13] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proc. 23rd SIGIR, pages 41 - 48. New York: ACM, 2002. [14] T. Joachims. A support vector method for multivariate performance measures.

In Proc. Intl. Conf Machine Learning, 2005.

[15] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325-340, 1987.

[16] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83-97, 1955.

[17] J. Munkres. Algorithms for the assignment and transportation problems. Journal of SIAM, 5(l):32 - 38, 1957.

[18] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In S. Thrun S.

Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 478-485. MIT Press, Cambridge, MA, 2003.

[19] S. Shalev-Shwartz and Y. Singer. Online learning meets optimisation in the dual. In H. U. Simon and G. Lugosi, editors, Computational Learning Theory (COLT),

LNCS. Springer, 2006. extended version.

[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine

Learning Research, 2005.

[21] E. Voorhees. Overview of the trect 2001 question answering track. In TREC,

2001.

[22] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching. IEEE Trans. PAMI, 18(4):377-388, 1996.

Claims

Claims Defining the Invention are as follows:

1. A method of optimising a scoring function comprising the steps of:

(a) for a query and a set of items, applying the scoring function to provide an ordered list of items;

(b) determining the inaccuracy of the ordered list by comparing the ordered list to a corresponding predetermined preferred ordered list;

(c) calculating an aggregated loss of the inaccuracy; and

2. The method according to claim 1, wherein the step of determining the inaccuracy of the ordered list comprises calculating a performance measure for the ordered list; and the step of calculating the aggregated loss of the inaccuracy comprises calculating the aggregated loss for the performance measure.

3. The method according to claim 2, wherein the performance measure is one of the following: Winner Takes All (WTA);

Mean Reciprocal Rank (MRR);

Discounted Cumulative Gain (DCG and DCG@k);

Normalised Discounted Cumulative Gain (NDCG and NGCD@k);

Precision@k; or Expected Rank Utility.

4. The method according to claim 2 or 3, wherein the performance measure is multivariate.

5. The method according to any one of claims 2, 3 or 4, wherein the method is performed for multiple pairs of a query and set of items and the aggregated loss is the sum of the loss of the performance measure for each set of items.

6. The method according to any one of the claim 2 to 5, wherein the loss of a performance measure for a query and a set of items is the inner product of values determined by the performance measure on the ordered list and values determined by the performance measure on the corresponding predetermined preferred ordered list.

7. The method according to claim 6, wherein the loss of a performance measure for a query and a set of items is given by:

A(π, yy.= a(\)^τb(y) - a(π)^τb(y) where y is sorted in decreasing order, a relates to the position of an item in the ordered list as given by the performance measure, b relates to the score of the item as given by the performance measure, π is the permutation of the set and 1 denotes the identity permutation.

8. The method according to any one of claims 2 to 7, wherein the step of calculating the performance measure of the ordered list comprises calculating the performance measure of only a portion of the list, such as the highest scoring or lowest scoring items.

9. A method according to any one of the preceding claims, wherein the scoring function provides a score for each item, and the ordered list of items is based on the score for each item.

10. The method according to any one of the preceding claims, wherein the scoring function is position dependent.

11. The method according to any one of the preceding claims, wherein the scoring function is member of any one of the function classes neural networks, decision trees or rule based.

12. The method according to any one of the preceding claims, wherein the set of items comprises a first graph where each item is a node of the graph and the scoring function produces a new label for each node and the items are ordered according to the new label.

13. The method according to claim 12, wherein the new label of the node is based on the position of the node within the first graph and the edges between nodes in the ordered list.

14. The method according to any one of the preceding claims, wherein minimising the regularised convex upper bound of the aggregated loss is defined by an optimisation problem having in its objective function C times an upper bound '' on the losses incurred by the query and the set of items plus a regulariser ² " " which ensures that the scoring function is smooth.

15. The method according to claim 14, wherein the optimisation problem further comprises inequalities that ensure that the nonconvex loss ^ ¹^¹ ' is upper bounded in a convex fashion.

16 The method according to claim 15, wherein the optimisation problem is given by:

' minimise^ I + C∑ξ, subject to w^τ (Φ(A,?,,1) - Φ(A,<?_«,π,)) > A(π,y,) - ξ,. (I)

1=1 for all ξ_; > 0 and π_; € n and l ≤ i ≤ l

where Φ(Z),,<7₍, ;r,) are a so-called feature map, w is a parameter vector and / is the number of queries or sets of items.

17. The method according to claim 16, where the minimisation of the regularised convex upper bound of the aggregated loss is carried out directly via equation (I) and includes finding the weakest aspect of the current choice of the scoring function on the preferred ordered list.

18. The method according to any one of claims 14 to 17, wherein minimising the regularised convex upper bound of the aggregated loss comprises optimising Wolfe's dual optimisation problem using Lagrange multipliers.

19. The method according to claim 18, wherein the dual optimisation problem is given by: minimise^ a_iπA(π,y,) (lla)

subject to ∑ a_ιπ ≤ C and a_ιπ > 0 for all 1 < i < /. (lib) πeTl

20. The method according to any one of the preceding claims, wherein minimising the regularised convex upper bound of the aggregated loss includes repeatedly finding the weakest aspect of a current choice of the scoring function and minimising the aggregated loss based on this weak aspect of the scoring function.

21. The method according to claim 20, wherein minimising the aggregated loss based on the weak aspect of the scoring function is performed using column generation.

22. The method according to any one of the preceding claims, wherein the step of approximately minimising the aggregated loss by minimising the regularised convex upper bound of the aggregated loss comprises using linear assignment.

23. The method according to any one of the preceding claims, wherein the regularised convex upper bound of the aggregated loss includes dealing with a large number of constraints and the further comprises finding an ordered list that provides the maximal violator of the constraints and is solved by linear or quadratic assignment.

24. The method according to claim 23, wherein the linear assignment problem is solved using the Hungarian Marriage method.

25. The method according to any one of claims 23 or 24, wherein when an ordered list is found not to be the maximal violator of the constraint, the method comprises decreasing the regularised upper bound on the loss by quadratic programming.

26. The method according to any one of the preceding claims, wherein the step of approximately minimising the aggregated loss by minimising the regularised convex upper bound of the aggregated loss comprises using approximation algorithms, such as the Graduated Assignment algorithm.

27. The method according to any one of the preceding claims, wherein the optimised scoring function is applied to an Internet Search Engine to rank the display of web-pages that match a user's query.

28. The method according to any one of claims 1 to 27, wherein the optimised scoring function is applied to information retrieval tasks, collaborative filtering, recommender systems, authorship identification queries, ranking advertisements or document summarisation tasks.

29. The method according any one of claims 1 to 27, wherein the optimised scoring function is applied to a graph matching problem.

30. The method according to claim 29, where the preferred ordered list is comprised of nodes of a first graph, and the set of items are nodes of a second graph, where the each node of the second graph is to be matched to a node of the first graph.

31. The method according to claim 30, where the query is based on each node of the second graph.

32. The method according to any one of the preceding claims, wherein the predetermined preferred ordered list is part of a training data set, such that each predetermined preferred ordered list has a corresponding query and score for each item in the list.

33. The method according to any one of the preceding claims, using the optimised scoring function steps (a) to (d) are repeated on the same or different pair of a query and set of items.

34. The method according to any one of the preceding claims, wherein the minimisation is carried out by using a query and a set of items one at a time.

35. A computer system programmed to perform the method according to any one of claims 1 to 34.

36. Application software, that when installed on a computer system, is able to control the computer system to perform the method according to any one of claims 1 to

34.