The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research.
This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from the previous posting, the new ones are marked with *
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
- Bioinformatics/Medical Informatics
* MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
* MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
Machine Translation
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
Spatial Data Processing
* Experiences on Processing Spatial Data with MapReduce
Information Extraction and Text Processing
* Data-intensive text processing with MapReduce
* Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
Artificial Intelligence/Machine Learning/Data Mining
* Residual Splash for Optimally Parallelizing Belief Propagation
* Stochastic gradient boosted distributed decision trees
* Distributed Algorithms for Topic Models
* When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
* Cloud Computing Boosts Business Intelligence of Telecommunication Industry
* Parallel K-Means Clustering Based on MapReduce
* Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
* Parallel algorithms for mining large-scale rich-media data
* Scaling Simple and Compact Genetic Algorithms using MapReduce
* Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
-
For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our previous blog post.
Ads Analysis
Large-Scale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
Search Query Analysis
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)
Information Retrieval (Search)
* Efficient Clustering of Web Derived Data Sets
* The PageRank algorithm and application on searching of academic papers
* A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
– Clustering of emails into spam campaign
– Finding probability that 2 spam messages are sent form same machine
– Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Web-based Malware (2007)
Image and Video Processing
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Video Stream Re-Rendering
Map-Reduce Meets Wider Varieties of Applications (2008)
– Location detection in images
Networking
Reducible Complexity in DNS
Simulation
Map-Reduce Meets Wider Varieties of Applications (2008)
– Simulation of earthquakes (geology)
Statistics
* Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
– Measuring Wikipedia Editor similarity
Map-Reduce Meets Wider Varieties of Applications (2008)
– Netflix video recommendation
Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)
Numerical Mathematics
* Mapreduce for Integer Factorization
Graphs
* Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
* Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
* Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)
Who wrote the above papers? (section added 20100307)
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas
Best regards,
Amund Tveit, co-founder of Atbrox
Pingback: Recommendation: Data-intensive text processing with MapReduce
Pingback: the Cosmos» Blog Archive » Data Mining on Distributed System