Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or EU Data Retention Directive), Hadoop becomes increasingly relevant.
Several Hadoop technologies are inspired by Google’s infrastructure.
1. Processing and Storage
1.1 Processing – Mapreduce
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions – Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop. In order to use Mapreduce developers need to parallelize their problem and program against an API – here for an example of machine learning with Hadoop. Hadoop’s Mapreduce is inspired by the paper MapReduce: Simplified Data Processing on Large Clusters.
1.2 File Storage – HDFS
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper The Google File System
1.3 Database – HBase
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see this presentation about Hbase. HBase is inspired by the paper Bigtable: A Distributed Storage System for Structured Data
2. Data Analysis
Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:
SQL-like language/system running on top of Mapreduce. Pig is developed by Yahoo and inspired by the paper Interpreting the Data: Parallel Analysis with Sawzall
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.
3. Distributed Systems Development
Coordination between distributed processes. It is inspired by the paper The Chubby lock service for loosely-coupled distributed systems
Monitoring of distributed systems.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – firstname.lastname@example.org. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce.
Amund Tveit, co-founder of Atbrox