May 25

Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.

  1. MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
    authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University

  2. Data-intensive text processing with Mapreduce
    authors: Jimmy Lin and Chris Dyer – University of Maryland

  3. Large-Scale Behavioral Targeting
    authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)

  4. Improving Ad Relevance in Sponsored Search
    authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)

  5. Experiences on Processing Spatial Data with MapReduce
    authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University

  6. Extracting user profiles from large scale data
    authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa

  7. Predicting the Click-Through Rate for Rare/New Ads
    authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad

  8. Parallel K-Means Clustering Based on MapReduce
    authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences

  9. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
    authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas

  10. Map-Reduce Meets Wider Varieties of Applications
    authors: Shimin Chen and Steven W. Schlosser – Intel Research

  11. LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
    authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)

  12. Efficient Clustering of Web-Derived Data Sets
    authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)

  13. A novel approach to multiple sequence alignment using hadoop data grids
    authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology

  14. Web-Scale Distributional Similarity and Entity Set Expansion
    authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)

  15. Grammar based statistical MT on Hadoop
    authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)

  16. Distributed Algorithms for Topic Models
    authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine

  17. Parallel algorithms for mining large-scale rich-media data
    authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research

  18. Learning Influence Probabilities In Social Networks
    authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)

  19. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
    authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University

  20. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
    authors: Zhi-Dan Zhao and Ming-sheng Shang

    Atbrox on LinkedIn

    Best regards,

    Amund Tveit (Atbrox co-founder)

Tagged with:
May 08

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. . Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce

This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Who wrote the above papers?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

Atbrox on LinkedIn

Best regards,
Amund Tveit (Atbrox co-founder)

Tagged with:
Feb 17

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research.

Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or EU Data Retention Directive), Hadoop becomes increasingly relevant.

Hadoop Technologies

Several Hadoop technologies are inspired by Google’s infrastructure.

1. Processing and Storage

1.1 Processing – Mapreduce
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions – Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop. In order to use Mapreduce developers need to parallelize their problem and program against an API – here for an example of machine learning with Hadoop. Hadoop’s Mapreduce is inspired by the paper MapReduce: Simplified Data Processing on Large Clusters.

1.2 File Storage – HDFS
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper The Google File System

1.3 Database – HBase
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see this presentation about Hbase. HBase is inspired by the paper Bigtable: A Distributed Storage System for Structured Data

2. Data Analysis

Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:

2.1 Pig
SQL-like language/system running on top of Mapreduce. Pig is developed by Yahoo and inspired by the paper Interpreting the Data: Parallel Analysis with Sawzall

2.2 Hive
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.

3. Distributed Systems Development

3.1 Avro
Avro is used for efficient serialization of data and communication between services. It is in several ways similar to Google’s protocolbuffers and Facebook’s Thrift.

3.2 Zookeeper
Coordination between distributed processes. It is inspired by the paper The Chubby lock service for loosely-coupled distributed systems

3.3 Chukwa
Monitoring of distributed systems.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce.

Best regards,

Amund Tveit

Tagged with:
preload preload preload
Blog WebMastered by All in One Webmaster.