Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce
This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
-
Ads Analysis
*Improving ad relevance in sponsored search
*Predicting the Click-Through Rate for Rare/New Ads
*Learning Influence Probabilities in Social Networks
*Mining advertiser-specific user behavior using adfactors
*Extracting user profiles from large scale data
Large-Scale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
Bioinformatics/Medical Informatics
*A novel approach to multiple sequence alignment using hadoop data grids
MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
Machine Translation
*Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
Spatial Data Processing
Experiences on Processing Spatial Data with MapReduce
Information Extraction and Text Processing
*Statistical Sentence Chunking Using Map Reduce
Data-intensive text processing with MapReduce
Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
Artificial Intelligence/Machine Learning/Data Mining
*LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
*Stateful Bulk Processing for Incremental Analytics
*Mining dependency in distributed systems through unstructured logs analysis
*Beyond online aggregation: parallel and incremental data mining with online mapreduce
*Learning based opportunistic admission control algorithm for mapreduce as a service
*OWL reasoning with WebPIE: calculating the closure of 100 billion triples
*Scaling ECGA model building via data-intensive computing
*SPARQL basic graph pattern processing with iterative mapreduce
Residual Splash for Optimally Parallelizing Belief Propagation
Stochastic gradient boosted distributed decision trees
Distributed Algorithms for Topic Models
When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
Parallel K-Means Clustering Based on MapReduce
Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Parallel algorithms for mining large-scale rich-media data
Scaling Simple and Compact Genetic Algorithms using MapReduce
Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
-
For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our previous blog post.
Search Query Analysis
*Parallelizing Random Walk with Restart for large-scale query recommendation
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)
Information Retrieval (Search)
*Automatically Incorporating New Sources in Keyword Search-Based Data Integration
*Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
*Learning URL patterns for webpage de-duplication
*Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser
*MIREX: Mapreduce Information Retrieval Experiments
Efficient Clustering of Web Derived Data Sets
The PageRank algorithm and application on searching of academic papers
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
– Clustering of emails into spam campaign
– Finding probability that 2 spam messages are sent form same machine
– Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Web-based Malware (2007)
Image and Video Processing
*Font rendering on a GPU-based raster image processor
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Video Stream Re-Rendering
Map-Reduce Meets Wider Varieties of Applications (2008)
– Location detection in images
Networking
Reducible Complexity in DNS
Simulation
Map-Reduce Meets Wider Varieties of Applications (2008)
– Simulation of earthquakes (geology)
Statistics
*User-based collaborative filtering recommendation algorithms on hadoop
Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
– Measuring Wikipedia Editor similarity
Map-Reduce Meets Wider Varieties of Applications (2008)
– Netflix video recommendation
Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)
Numerical Mathematics
*Distributed non-negative matrix factorization for dyadic data analysis on mapreduce
*A mapreduce algorithm for SC
*Multi-GPU Volume Rendering using MapReduce
Mapreduce for Integer Factorization
Sets & Graphs
*Towards scalable RDF graph analytics on MapReduce
*Efficient Parallel Set-Similarity Joins using Mapreduce
*Max-cover algorithm in map-reduce
Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)
Who wrote the above papers?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas
Best regards,
Amund Tveit (Atbrox co-founder)