**The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org**

An updated and extended version of this blog post can be found here.

**Motivation**

Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

*Disclaimer: this is work in progress (look for updates)*

**Input Data – Academic Papers**

Scholar has 981 papers citing the original Mapreduce paper from 2004 – a citation amount that is approximately 10 thousand pages (~ size of a typical encyclopedia)

**What types of papers cite the mapreduce paper?**

- Algorithmic papers
- General cloud overview papers
- Cloud infrastructure papers
- Future work sections in papers (e.g. “we plan to implement this with Hadoop”)

**=> Looked at category 1 papers and skipped the rest**

**Who wrote the papers?**

Search/Internet companies/organizations: eBay, Google, Microsoft, Wikipedia, Yahoo and Yandex.

IT companies: Hewlett Packard and Intel

Universities: Carnegie Mellon Univ., TU Dresden, Univ. of Pennsylvania, Univ. of Central Florida, National Univ. of Ireland, Univ. of Missouri, Univ. of Arizona, Univ. of Glasgow, Berkeley Univ. and National Tsing Hua Univ., Univ. of California, Poznan Univ.

**Which areas do the papers cover?**

**Machine Translation**

Grammar based statistical MT on Hadoop (2009)

Large Language Models in Machine Translation (2008)

**Information/Entity Extraction and Tagging**

Web-Scale Distributional Similarity and Entity Set Expansion (2009)

The infinite HMM for unsupervised PoS tagging (2009)

**Classification**

Scaling Up Classifiers to Cloud Computers (2008)

**Ads Analysis**

Large-Scale Behavioral Targeting (2009)

Search Advertising using Web Relevance Feedback (2008)

Predicting Ads’ ClickThrough Rate with Decision Rules (2008)

**Search Query Analysis**

BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)

AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)

**Indexing & parsing**

On Single-Pass Indexing with MapReduce (2009)

A Data Parallel Algorithm for XML DOM Parsing (2009)

Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)

**Spam & Malware Detection**

Characterizing Botnets from Email Spam Records (2008)

– Clustering of emails into spam campaign

– Finding probability that 2 spam messages are sent form same machine

– Estime likelihood of botnets based on common senders in spam campaigns

The Ghost In The Browser Analysis of Web-based Malware (2007)

**Image and Video Processing**

MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

– Video Stream Re-Rendering

Map-Reduce Meets Wider Varieties of Applications (2008)

– Location detection in images

**Networking**

Reducible Complexity in DNS

**Simulation**

Map-Reduce Meets Wider Varieties of Applications (2008)

– Simulation of earthquakes (geology)

**Statistics**

Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)

MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

– Digg.com story recommendations

Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)

– Measuring Wikipedia Editor similarity

Map-Reduce Meets Wider Varieties of Applications (2008)

– Netflix video recommendation

Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

**Graphs**

DOULION: Counting Triangles in Massive Graphs with a Coin (2009)

Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)

**Conclusion**

On the papers looked at most of them are focused on IT-related areas, there is lots of unwritten in academia about mapreduce and hadoop applied for algorithms in other business and technology areas.

Opportunity for following up this posting can be to: 1) in more detail describe the algorithms (e.g. input/output formats), 2) try to classify them by patterns (e.g. with similar code structure), 3) offer the opportunity to simulate them in the browser (on toy-sized data sets) and 4) provide links to Hadoop implementations of them.

**Do you need help with Hadoop/Mapreduce?**

A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce