May 25
Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.
- MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University
- Data-intensive text processing with Mapreduce
authors: Jimmy Lin and Chris Dyer – University of Maryland
- Large-Scale Behavioral Targeting
authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)
- Improving Ad Relevance in Sponsored Search
authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)
- Experiences on Processing Spatial Data with MapReduce
authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University
- Extracting user profiles from large scale data
authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa
- Predicting the Click-Through Rate for Rare/New Ads
authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad
- Parallel K-Means Clustering Based on MapReduce
authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences
- Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas
- Map-Reduce Meets Wider Varieties of Applications
authors: Shimin Chen and Steven W. Schlosser – Intel Research
- LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)
- Efficient Clustering of Web-Derived Data Sets
authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)
- A novel approach to multiple sequence alignment using hadoop data grids
authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology
- Web-Scale Distributional Similarity and Entity Set Expansion
authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)
- Grammar based statistical MT on Hadoop
authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)
- Distributed Algorithms for Topic Models
authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine
- Parallel algorithms for mining large-scale rich-media data
authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research
- Learning Influence Probabilities In Social Networks
authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)
- MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University
- User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
authors: Zhi-Dan Zhao and Ming-sheng Shang

Best regards,
Amund Tveit (Atbrox co-founder)
Tagged with: algorithms • china mobile • google • hadoop • mapreduce • yahoo • yandex • zhejian university
May 08
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. . Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce
This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
Who wrote the above papers?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

Best regards,
Amund Tveit (Atbrox co-founder)
Tagged with: google • hadoop • machinelearning • mapreduce • yahoo
Feb 28
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research.
Yahoo recently announced the Learning to Rank Challenge – a pretty interesting web search challenge (as the somewhat similar Netflix Prize Challenge also was).
Data and Problem
The data sets contains (to my interpretation) per line:
- url – implicitly encoded as line number in the data set file
- relevance – low number=high relevance and vice versa
- query – represented as an id
- features – up to several hundreds
and the problem is to find a function that gives relevance numbers per url per query id.
Initial Observation
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.

Multiple URLs that shares relevance for a unique query can perhaps be due to:
- similar/duplicate content between the URLs?
- a frequent query (due to sampling of examples?)
- uncertainty about which URL to select for particular a relevance and query?
- there is a tie, i.e. they are equally relevant
Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:
- Use relevance levels as classes (nominal regression) and use a multiclass-classifier
- Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
- Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
- support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
- use a classifier that supports weighing examples and the approach in 1 or 2.
More about ranking with machine learning?
Check out the learning to rank bibliography.
Best regards,
Amund Tveit
Tagged with: classification • machine learning • netflix • ranking • regression • relevance • search • yahoo