May 25

Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.

  1. MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
    authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University

  2. Data-intensive text processing with Mapreduce
    authors: Jimmy Lin and Chris Dyer – University of Maryland

  3. Large-Scale Behavioral Targeting
    authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)

  4. Improving Ad Relevance in Sponsored Search
    authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)

  5. Experiences on Processing Spatial Data with MapReduce
    authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University

  6. Extracting user profiles from large scale data
    authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa

  7. Predicting the Click-Through Rate for Rare/New Ads
    authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad

  8. Parallel K-Means Clustering Based on MapReduce
    authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences

  9. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
    authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas

  10. Map-Reduce Meets Wider Varieties of Applications
    authors: Shimin Chen and Steven W. Schlosser – Intel Research

  11. LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
    authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)

  12. Efficient Clustering of Web-Derived Data Sets
    authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)

  13. A novel approach to multiple sequence alignment using hadoop data grids
    authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology

  14. Web-Scale Distributional Similarity and Entity Set Expansion
    authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)

  15. Grammar based statistical MT on Hadoop
    authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)

  16. Distributed Algorithms for Topic Models
    authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine

  17. Parallel algorithms for mining large-scale rich-media data
    authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research

  18. Learning Influence Probabilities In Social Networks
    authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)

  19. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
    authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University

  20. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
    authors: Zhi-Dan Zhao and Ming-sheng Shang

    Atbrox on LinkedIn

    Best regards,

    Amund Tveit (Atbrox co-founder)

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
Tagged with:
Oct 01

The newest and most up-to-date version (May 2010) this blog post is available at

An updated and extended version of this blog post can be found here.

Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Disclaimer: this is work in progress (look for updates)

Input Data – Academic Papers
Scholar has 981 papers citing the original Mapreduce paper from 2004 – a citation amount that is approximately 10 thousand pages (~ size of a typical encyclopedia)

What types of papers cite the mapreduce paper?

  1. Algorithmic papers
  2. General cloud overview papers
  3. Cloud infrastructure papers
  4. Future work sections in papers (e.g. “we plan to implement this with Hadoop”)

=> Looked at category 1 papers and skipped the rest

Who wrote the papers?

Search/Internet companies/organizations: eBay, Google, Microsoft, Wikipedia, Yahoo and Yandex.
IT companies: Hewlett Packard and Intel
Universities: Carnegie Mellon Univ., TU Dresden, Univ. of Pennsylvania, Univ. of Central Florida, National Univ. of Ireland, Univ. of Missouri, Univ. of Arizona, Univ. of Glasgow, Berkeley Univ. and National Tsing Hua Univ., Univ. of California, Poznan Univ.

Which areas do the papers cover?

On the papers looked at most of them are focused on IT-related areas, there is lots of unwritten in academia about mapreduce and hadoop applied for algorithms in other business and technology areas.

Opportunity for following up this posting can be to: 1) in more detail describe the algorithms (e.g. input/output formats), 2) try to classify them by patterns (e.g. with similar code structure), 3) offer the opportunity to simulate them in the browser (on toy-sized data sets) and 4) provide links to Hadoop implementations of them.

Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload