Feb 28

Yesterday Yahoo announced the Learning to Rank Challenge – a pretty interesting challenge (as the somewhat similar Netflix Prize Challenge also was).

Data and Problem
The data sets contains (to my interpretation) per line:

  1. url – implicitly encoded as line number in the data set file
  2. relevance – low number=high relevance and vice versa
  3. query – represented as an id
  4. features – up to several hundreds

and the problem is to find a function that gives relevance numbers per url per query id.

Initial Observation
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.

Multiple URLs that shares relevance for a unique query can perhaps be due to:

  1. similar/duplicate content between the URLs?
  2. a frequent query (due to sampling of examples?)
  3. uncertainty about which URL to select for particular a relevance and query?
  4. there is a tie, i.e. they are equally relevant

Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:

  1. Use relevance levels as classes (nominal regression) and use a multiclass-classifier
  2. Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
  3. Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
    1. support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
    2. use a classifier that supports weighing examples and the approach in 1 or 2.

Conclusion
Still have more questions than answers, so next step is the learning to rank bibliography.

Tagged with:
Feb 12

This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from the previous posting, the new ones are marked with *

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Who wrote the above papers? (section added 20100307)
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas


Do you need help with Hadoop/Mapreduce?

Contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our previous posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Tagged with:
Nov 04

Back in May 2000 I wrote that “It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue. Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR “food chain”"

80legs

80legs is a company specializing in the crawling and preprocessing part, where you can upload your seed urls (where to start crawling), configure your crawl job (depth, domain restrictions etc.) and also run existing or custom analysis code (upload java jar-files) on the fetched pages. When you upload seed files 80legs does some filtering before starting to crawl (e.g. if you have seed urls which are not well-formed), and also handles domain throttling and robots.txt (and perhaps other things).

Computational model: Since you can run custom code per page it can be seen as a mapper part of a MapReduce (Hadoop) job (one map() call per page); for reduce-type processing (over several pages) you need to move your data elsewhere (e.g. EC2 in the cloud). Side note: another domain with “reduce-less” mapreduce is quantum computing, check out Michael Nilsen’s Quantum Computing for Everyone.

Testing 80legs

Note: We have only tried with the built-in functionality and no custom code so far.

1) URL extraction

Job description: We used a seed of approximately 1,000 URLs and crawled and analyzed ~2.8 million pages within those domains. The regexp configuration was used (we only provided the URL matching regexp).

Result: Approximately 1 billion URLs were found, and results came in 106 zip-files (each ~14MB packed and ~100MB unpacked) in addition to zip files of the URLs that where crawled.

Note: Based on a few smaller similar jobs it looks like the parallelism of 80legs is somewhat dependent of the number of domains in the crawl and perhaps also on their ordering. In case you have a set of URLs where each domain has more than one URL it can be useful to randomize your seed URL file before uploading and running the crawl job, e.g. by using rl or coreutil’s shuf.

2) Fetching pages

Job description: We built a set of URLs – ~80k URLs that we wanted to fetch as html (using their sample application called 80App Get Raw HTML) for further processing. The URLs were split into 4 jobs of ~20k URLs each.

Result: Each job took roughly one hour (they all ran in parallel so the total time spent was 1 hour). We ended up with 5 zip files per job, each zip file having ~25MB of data (100MB unpacked), i.e. ~4*5*100MB = 2GB raw html when unpacked for all jobs.

Conclusion

80legs is an interesting service that has already proved useful for us, and we will continue to use it in combination with AWS and EC2. Custom code needs to be built (e.g. related to ajax crawling).

(May 2000 – A few thoughts about the future of Internet Information Retrieval)

Tagged with:
preload preload preload