Feb 28

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research.

Yahoo recently announced the Learning to Rank Challenge – a pretty interesting web search challenge (as the somewhat similar Netflix Prize Challenge also was).

Data and Problem
The data sets contains (to my interpretation) per line:

  1. url – implicitly encoded as line number in the data set file
  2. relevance – low number=high relevance and vice versa
  3. query – represented as an id
  4. features – up to several hundreds

and the problem is to find a function that gives relevance numbers per url per query id.

Initial Observation
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.

Multiple URLs that shares relevance for a unique query can perhaps be due to:

  1. similar/duplicate content between the URLs?
  2. a frequent query (due to sampling of examples?)
  3. uncertainty about which URL to select for particular a relevance and query?
  4. there is a tie, i.e. they are equally relevant

Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:

  1. Use relevance levels as classes (nominal regression) and use a multiclass-classifier
  2. Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
  3. Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
    1. support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
    2. use a classifier that supports weighing examples and the approach in 1 or 2.

More about ranking with machine learning?
Check out the learning to rank bibliography.

Enhanced by Zemanta

Best regards,

Amund Tveit

Tagged with:
Feb 17

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research.

Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or EU Data Retention Directive), Hadoop becomes increasingly relevant.

Hadoop Technologies

Several Hadoop technologies are inspired by Google’s infrastructure.

1. Processing and Storage

1.1 Processing – Mapreduce
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions – Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop. In order to use Mapreduce developers need to parallelize their problem and program against an API – here for an example of machine learning with Hadoop. Hadoop’s Mapreduce is inspired by the paper MapReduce: Simplified Data Processing on Large Clusters.

1.2 File Storage – HDFS
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper The Google File System

1.3 Database – HBase
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see this presentation about Hbase. HBase is inspired by the paper Bigtable: A Distributed Storage System for Structured Data

2. Data Analysis

Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:

2.1 Pig
SQL-like language/system running on top of Mapreduce. Pig is developed by Yahoo and inspired by the paper Interpreting the Data: Parallel Analysis with Sawzall

2.2 Hive
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.

3. Distributed Systems Development

3.1 Avro
Avro is used for efficient serialization of data and communication between services. It is in several ways similar to Google’s protocolbuffers and Facebook’s Thrift.

3.2 Zookeeper
Coordination between distributed processes. It is inspired by the paper The Chubby lock service for loosely-coupled distributed systems

3.3 Chukwa
Monitoring of distributed systems.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce.

Best regards,

Amund Tveit

Tagged with:
Feb 12

The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research.

This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from the previous posting, the new ones are marked with *

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Who wrote the above papers? (section added 20100307)
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our previous posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Tagged with:
preload preload preload
Blog WebMastered by All in One Webmaster.