The newest and most uptodate version (May 2010) this blog post is available at http://mapreducebook.org
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research.
This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from the previous posting, the new ones are marked with *
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
 Bioinformatics/Medical Informatics
* MapReduceBased Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
* MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
Machine Translation
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
Spatial Data Processing
* Experiences on Processing Spatial Data with MapReduce
Information Extraction and Text Processing
* Dataintensive text processing with MapReduce
* WebScale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
Artificial Intelligence/Machine Learning/Data Mining
* Residual Splash for Optimally Parallelizing Belief Propagation
* Stochastic gradient boosted distributed decision trees
* Distributed Algorithms for Topic Models
* When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via DataIntensive Computing
* Cloud Computing Boosts Business Intelligence of Telecommunication Industry
* Parallel KMeans Clustering Based on MapReduce
* Largescale multimedia semantic concept modeling using robust subspace bagging and MapReduce
* Parallel algorithms for mining largescale richmedia data
* Scaling Simple and Compact Genetic Algorithms using MapReduce
* Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)

For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our previous blog post.
Ads Analysis
LargeScale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
Search Query Analysis
BBM: Bayesian Browsing Model from Petabytescale Data (2009)
AIDE: Adhoc Intents Detection Engine over Query Logs (2009)
Information Retrieval (Search)
* Efficient Clustering of Web Derived Data Sets
* The PageRank algorithm and application on searching of academic papers
* A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On SinglePass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
– Clustering of emails into spam campaign
– Finding probability that 2 spam messages are sent form same machine
– Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Webbased Malware (2007)
Image and Video Processing
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Video Stream ReRendering
MapReduce Meets Wider Varieties of Applications (2008)
– Location detection in images
Networking
Reducible Complexity in DNS
Simulation
MapReduce Meets Wider Varieties of Applications (2008)
– Simulation of earthquakes (geology)
Statistics
* Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
– Measuring Wikipedia Editor similarity
MapReduce Meets Wider Varieties of Applications (2008)
– Netflix video recommendation
Largescale Parallel Collaborative Filtering for the Netflix Prize (2008)
Numerical Mathematics
* Mapreduce for Integer Factorization
Graphs
* Distributed Algorithm for Computing Formal Concepts Using MapReduce Framework
* Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
* Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in realworld networks: proofs, algorithms and observations (2008)
Who wrote the above papers? (section added 20100307)
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas
Best regards,
Amund Tveit, cofounder of Atbrox
See also:
http://wiki.apache.org/hadoop/Papers
Theoretical models for MapReduce are:
A Model of Computation for MapReduce (2010)
and On the Complexity of Processing Massive, Unordered, Distributed Data (2008)
See also:
http://www.umiacs.umd.edu/~jimmylin/book.html
nevermind, just noticed that you have it there
this one by Afrati and Ullman is also interesting:
http://portal.acm.org/citation.cfm?id=1739041.1739056
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.5904&rep=rep1&type=pdf
Pingback: Recommendation: Dataintensive text processing with MapReduce
Pingback: the Cosmos» Blog Archive » Data Mining on Distributed System