Data and Problem
The data sets contains (to my interpretation) per line:
- url – implicitly encoded as line number in the data set file
- relevance – low number=high relevance and vice versa
- query – represented as an id
- features – up to several hundreds
and the problem is to find a function that gives relevance numbers per url per query id.
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.
Multiple URLs that shares relevance for a unique query can perhaps be due to:
- similar/duplicate content between the URLs?
- a frequent query (due to sampling of examples?)
- uncertainty about which URL to select for particular a relevance and query?
- there is a tie, i.e. they are equally relevant
Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:
- Use relevance levels as classes (nominal regression) and use a multiclass-classifier
- Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
- Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
- support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
- use a classifier that supports weighing examples and the approach in 1 or 2.
More about ranking with machine learning?
Check out the learning to rank bibliography.
Related articles by Zemanta
- Twitter Has 24 Billion Search Queries Per Month (penn-olson.com)
- Bing’s Categorized Search Results (seobythesea.com)
- Top Rankings for Facebook Pages (searchengineguide.com)
Amund Tveit, co-founder of Atbrox