Oct 27

SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for  a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants “heavier data” in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)

Unstructured Search for SimpleDB

This posting presents an approach of how to add (flexible) unstructured search support to SimpleDB (with some preliminary query latency numbers below – and very preliminary python code). The motivation is:
  1. Support unstructured search with very low maintenance
  2. Combine structured and unstructured search
  3. Figure out the feasibility of unstructured search on top of SimpleDB

The Structure of SimpleDB

SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable)  has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name). Check out Wikipedia for detailed SimpleDB storage characteristics.

Inverted files

Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey – Inverted files for text search engines)

One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from “0” to “249” and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item – e.g. for stop words – one could “wrap” the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).

Preliminary query latency results

Warning: Data sets used were  NLTK‘s inaugural collection, so far from the biggest.

Inverted File Entry Fetch latency Distribution (in seconds)

Conclusion: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.

Preliminary Python code including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 07

Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:

  1. Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
  2. Skim quickly through the Shedskin tutorial
  3. Log into the EC2 node and install the Shedskin Python compiler
  4. Write your Python mapper or reducer program and compile it into C++ with Shedskin
    • E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
  5. Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
    • note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
  6. Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
  7. Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
  8. Run strip on the binary to make it smaller
  9. Upload your (ready) binary to a chosen location in Amazon S3
  10. Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.

Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.

Note: this approach should probably work also with Cloudera’s distribution for Hadoop.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or

contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 03

Other recommended writeups :

Location: Roosevelt Hotel, NYC

1235 Joe Cunningham – Visa – Large scale transaction analysis
– responsible for Visa Technology Strategy and Innovation
been playing with Hadoop for 9 months
probably many in audience learning and starting out with Hadoop

Agenda:
1) VisaNet overview
2) Value-added information products
3) Hadoop@Visa – research results

About Visa:
– 60 Billion market cap
– well-known card products, and also behind the scene information products
– Visa brand has high trust
– For a card-holder a Visa-card means global acceptance
– For a shopowner, if you get a Visa payment aproval you will be payed

VisaNet
VisaNet is the largest, most advanced payment network in the world
characteristics:
28M locations,
130M authorizations/day,
1500 endpoints,
Processes transactions faster than 1s
1.4M ATMs,
Processes in 175 currencies,
Less than 2s unavailability per year (!)
– according to my calculations six 9s (0.999999366)
16300 financial institutions

Visa Processing Architecture
Security/Access Services -> Message|File|Web
VisaNet Services Integration -> Authorization|Clearing&Settlement
Dispute handling, Risk, Information
Scoring every transaction (used for issuer to approve/decline transaction)

Value added Info products
– Info services
Client: Portfolio Analysis, Visa Incentive Network
Accountholder: transaction alerts, accoutnt updater, tailored rewards
– Risk management services
Account monitoring
Authentication
Encyption

Hadoop@Visa
Run a pipeline of prototypes in lab facility in SF
Any technology taken into Visa needs to match scalability and reliability requirements

Research Lab Setup
– VM System:
Custom Analytic Stacks
Encryption Processing
Relational Database
– Hadoop Systems
Management Stack
Hadoop #1 ~40TB / 42 nodes (2 years of raw transaction data)
Hadoop #2 ~300TB / 28 nodes

Risk Product Use Case
Create critical data model elements, such as keys and transaction statistics, which feed our real-time risk-scoring systems
Input: Transactions – Merchant Category, Country/Zip
Output: Key & Statistics – MCCZIP Key – stats related to account, trans. type, approval, fraud, IP address etc.
Research Sample: 500M distinct accounts, 100M transactions per day, 200 bytes per transaction, 2 years – 73B transaction (36TB)
Processing time from 1 month to 13 minutes! (note: ~3000 times faster)
(Generate synthetic transactions used to test the model)

Financial Enterprise Fit
– key questions under research:
– what will the Hadoop Solution Stack(s) look like?
– File system, Transaction Sample System, Relational Back-end (integration path), Analytics Processing
– Internal vs external cloud
– How do I get data into a cloud in a secure way.
– How does HSM and security integration work in Hadoop
– What are the missing pieces?

Why Hadoop@Visa?
– analyze volumes of data with response that are not possible today
– requirement: need to fit with existing solutions

Cross Data Center Log Processing – Stu Hood, Rackspace

(Email and apps division, work on search team)

Agenda
Use Case Backgound
– “Rackapps” – Hybrid Mail Hosting, 40% use a mix of exchange and rackspace mail

Use Case: Log Types

Use Case: Querying
– was the mail delivered?
– spam – why was it (not) marked as spam
– access – who checked/failed to check mail?
more advanced questions:
– which delivery routes have the highest latency?
– which are the spammiest IP?
– Where in the world do customers log in from
Elsewhere:
– billing

Previous Solutions
– 1999-2006 – go to where log files are generated, querying with grep
– 2006-2007 / bulk load to MySQL – worked for a year

Hadoop Solution
– V3 – lucene indexes in Hadoop
– 2007- present
– store 7 days uncompressed
– queries take seconds
– long term queries with mapreduce (6M avail for MR queries)
– all 3 datacenters

Alternatives considered:
– Splunk – good for realtime, but not great for archiving
– Data warehouse package – not realtime, but fantastic for longterm analysis
– Partioned MySQL – half-baked solution
=> Hadoop hit the sweet spot

Hadoop Implementation
SW
– collect data using syslog-ng (considering Scribe)
– storage: deposits into Hadoop (scribe will remove that)
HW
– 2-4 collector machines per datacenters
– hundreds of source machines
20 solr nodes

Implementation: Indexing/Querying
– indexing – uniqe processing code for schema
– querying
– “realtime”
– sharded lucene/solr instances merge-index chunk from Hadoop
– using Solr-API
– raw logs
– using Hadoop Streaming and unix grep
– Mapreduce

Implementation: Timeframe
– development – 1.5 people in 3 months
– deployments – using clouderas distribution
– roadblocks – bumped into job-size limits

Have run close to 1 million jobs on our cluster, and it has not gone down (except for other reasons such as maintenance)

Advantages – storage
– all storage in one place
Raw logs: 3 days, in HDFS
Indexes: 7 days
Archived Indexes: 6 months

Advantages – analysis
– Java Mapreduce API
– Apache Pig
– ideal for one-off queries
– Hadoop Streaming

Pig Example – whitehouse.gov mail spoofing

Advantages – Scalability, Cost, Community
– scalability – easy to add nodes
– cost – only hardware
– community – cloudera has been a benefit, deployment is trivial

Data Processing for Financial Services – Peter Krey and Sin Lee, JP Morgan Chase

Innovation & Shared Services, Firmwide Engineering & Architecture

note: certain constraints what can be shared due to regulations

JPMorgen Chase + Open Source
– QPD (AMQP) – top level apache project
– Tyger – Apache + Tomcat + Spring

Hadoop in the Enterprise – Economics Driven
– attractive: economics
– Many big lessons from Web 2.0 community
– Potential for Large Capex and Opex “Dislocation”
– reduce consumption of enterprise premium resources
– grid computing economics brought to data intensive computing
– stagnant data innovation
– Enabling & potentially disruptive platform
– many historical similarities
– java, linux, tomcat, web/internet
– minis to client/server, client/server to web, solaris to linux, ..
– Key question: what can be built on top of Hadoop?
Back to economics driven – very cost-effective

Hadoop in the Enterprise – Choice Driven
– Overuse of relational database containers
– institutional “Muscle memory” – not too much else to choose from
– increasingly large percentage of static data stored in proprietary transactional DBs
– Over-Normalized Schemas: still Makes sense with cheap compute&storage?

– Enterprise Storage “Prisoners”
– Captive to the economics & technology of “a few” vendors
– Developers need more choice
– Too much proprietary, single-source data infrastructure
– increasing need for minimal/no systems + storage admins

Hadoop in the Enterprise – Other Drivers
– Growing developer interest in “Reduced RDBMS” Data technologies
– open source, distributed, non-relational databases
– growing influence of web 2.0 technologies & thinking of enterprise
– hadoop, cassandra, hbase, hive, couchdb, hadoopDB, .. , others
– memcached for caching

FSI Industry Drivers
– Increased regularity oversight + reporting = More data needed over longer period of time
– triple data amounts from 2007 to 2009
– growing need for less expensive data repository/store
– increased need to support “one off” analysis on large data

Active POC Pipeline
– Growing stream of real projects to gauge hadoop “goodness of fit”
– broad spectrum of use cases
– driven by need to impact/dislocate OPEX+CAPEX
– looking for orders of magnitude
– evaluated on metric based performance, functional and economic measures
– avoid the “data falling on the floor phenomena”
– tools are really really important, keep tools and programming models simple

Hadoop Positiong
– Latency x Storage amount curve,

Cost comparisons
– SAN vs Hadoop HDFS cost comparison (GB/month)
– Hadoop much cheaper

Hadoop Additions and Must Haves:
– Improves SQL Front-End Tool Interoperability
– Improved Security & ACL enforcement – Kerberos Integration
– Grow Developer Programming Model Skill Sets
– Improve Relational Container Integration & Interop for Data Archival
– Management & Monitoring Tools
– Improved Developer & Debugging Tools
– Reduce Latency via integration with open source data caching
– memcached – others
– Invitation to FSI or Enterprise roundtable

Protein Alignment – Paul Brown, Booz Allen

Biological information
– Body – Cells – Chromosomes – Gene – DNA/RNA

Bioinformatics – The Pain
– too much data

So What? Querying a database of sequences for similar sequences
– one-to-many comparison
– 58000 proteins in PDB
– Protein alignment frequently used in the development of medicines
– Looking for a certain sequence across species, helps indicate function
Implementation in Hadoop
– distribute database sequence accross each node
– send query seq. inside Mapreduce (or dist.cache)
– scales well
– existing algorithms port easily

So What? Comparing sequences in bulk
– many-to-many
– DNA hybridiation (reconstruction)
Ran on AWS
Hadoop:
– if whole dataset fit into one computer
– Used distributed cache, assign each node a piece of the list
– But if the does not fit on one computer….
– pre-join all possible pairs with one MapReduce

So What? Analyzing really big sequences
– one big sequence to many small sequences
– scanning dna for structure
– population genetics
– hadoop implementatoin

Demonstration Implementation: Smith-Waterman Alignment
– one of the more computationally intensive matching and aligmnent techniques
– big matrix – (sequences to compare on row and column and calculations within)

Amazon implementation
– 250 machines
– E2
– run in 10 minutes for a single sequence. Runs in 24hrs for NxN comparison
– cost $40/hr

==> very cool 3D video of amazon ec2 nodes
– failing job due to 10% of nodes stuck on something (e.g. very long sequences)

Real-time Business Intelligence, Bradford Stephens

Topics
– Scalability and BI
– Costs and Abilities
– Search as BI

Tools: Zookeeper, Hbase, Katta (dist.search on Hadoop) and Bobo (faceted search for lucene)
– http://sourceforge.net/projects/bobo-browse/
– http://sourceforge.net/projects/katta/develop

100TB structured and unstructed data – Oracle 100M$, Hadoop and Katta 250K$

Building data cubes in real time (with faceted search)

Real-time Mapreduce on HBase
Search/BI as a platform – “google my datawarehouse”

Counting, Clustering and other data tricks, Derek Gottfried, New York Times

back in 2007 – would like to try as many EC2 instances as possible

Problem
– freeing up historical archives of NYTimes.com (1851-1922)
(in the public domain)

Currently:
– 2009 – web analytics
3 big data buckets:
1) registration/demographics
2) articles 1851-today
– a lot of metadata about each article
– unique data, extract people, places, .. to each article => high precision search
3) usage data/web logs
– biggest piece – piles up

How do we merge the 3 datasets?

Using EC2 – 20 machines
Hadoop 0.20.0
12 TB of data
Straight MR in Java
(mostly java + postprocessing in python)

combining weblog data with demographic data, e.g. twitter clicks backs by age group


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 02

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
– Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
– Yahoo pushing env. on scalability
– Large clusters for academic research (yahoo, hp and intels open cirrus)
– nsf, ibm and google’s clue
– sigmod best paper award: Pig team from Yahoo
– worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
– file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
– Job Browser (Cluster Detail)
3) Cluster Health
– pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
– makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
– building a reusable API for dev. dekstop appl
– would like to capture innovation of ecosystem in a single interface
– desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
– data mining (log processing, clicks analysis)
– bioinformatics (genome analysis)
– financial simulation (monte carlo)
– file processing (resize jpegs, ocr) – a bit unexpected
– web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
– extended language to support S3
– specify off-instance-metadata store
– optimized data writes to S3
– reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
– deploy hadoop jobs to EMR
– monitor progress of EMR job flows
– amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
– can specify Cloudera’s distribution (and get support from Cloudera)
– in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
– continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
– run the latest version, currently 0.20
3) science and research (60% of HW)
– runs more stable versions, currently 0.20
4) production (20% of HW)
– the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
– 500M users, billions of “transactions”/day, Many petabytes of data
– analysis and data processing key to our business
– need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
– content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
– 70 hours runtime => 73 hours runtime
– 300TB shuffling => 490TB shuffling
– 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
– 1 terabyte 209 seconds => 62 seconds on 1500 nodes
– 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
– research questions answered in days, not months
– moved from research to prod easily

Major factors:
– don’t need to find new HW to experiment
– can work with all your data
– prod. and research on same framework
– no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
Hadoop:
– simplifies porting effort (between hadoop versions), freeze APIs, Avro
– GridMix3, Mumak simulator – for performance tuning
– quality engineering
Pig
– Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations
Oozie

1035 Rod Smith, IBM

Customer Scenarios
– BBC Digital Democracy project
– Thomson Reuters
– IBM Emerging Technology Projects: M2 (renamed later to M42)
– insight engine for ad-hoc business insights running ontop of Hadoop and Pig
– macro-support (e.g. extract patent information)
– collections (probably renamed to worksheets later)
– visualization (tag cloud)
– example 1: evaluate companies with patent information(1.4 million patents)
– using American Express as case study
– counting patent citations
– example 2: patents in litigation
– quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Hadoop
Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
– query your own formats and types with serializer/deserializer
– extend SQL functionality through user defined functions
– do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
– Hash-based Aggregations
– Map-Side Joins
– Predicate Pushdown
– Partition Pruning
– ++

Interoperability
– JDBC and ODBC interfaces available
– integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
– ++

Hive Information
– subproject of Hadoop

— Date Warehousing @ Hadoop —

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter 😀

Dimensions:
4800 cores, 5.5 PB,

Statistics per day:
– 4TB compr.data/day
– 135TB scanned per day
– 7500 Hive jobs/day
– 80K compute hours per day

Hive Simplifies Hadoop:
– New engineers go through a Hive training session
– 200 people/moth use it

Applications:
– reporting (daily/weekly aggregations of impression/click counts)
– measures of user engagement
– microstragy dashboards

Ad hoc analysis
– how many group admins broken down by state/country

Machine learning (assembling training data)
– ad optimization
– e.g. user engagement as function of user attributes

Facebook Hive contributions
– Hive, HDFS features, Scheduler work
– Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 01

The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org

An updated and extended version of this blog post can be found here.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Disclaimer: this is work in progress (look for updates)

Input Data – Academic Papers
Scholar has 981 papers citing the original Mapreduce paper from 2004 – a citation amount that is approximately 10 thousand pages (~ size of a typical encyclopedia)

What types of papers cite the mapreduce paper?

  1. Algorithmic papers
  2. General cloud overview papers
  3. Cloud infrastructure papers
  4. Future work sections in papers (e.g. “we plan to implement this with Hadoop”)

=> Looked at category 1 papers and skipped the rest

Who wrote the papers?

Search/Internet companies/organizations: eBay, Google, Microsoft, Wikipedia, Yahoo and Yandex.
IT companies: Hewlett Packard and Intel
Universities: Carnegie Mellon Univ., TU Dresden, Univ. of Pennsylvania, Univ. of Central Florida, National Univ. of Ireland, Univ. of Missouri, Univ. of Arizona, Univ. of Glasgow, Berkeley Univ. and National Tsing Hua Univ., Univ. of California, Poznan Univ.

Which areas do the papers cover?

Conclusion
On the papers looked at most of them are focused on IT-related areas, there is lots of unwritten in academia about mapreduce and hadoop applied for algorithms in other business and technology areas.

Opportunity for following up this posting can be to: 1) in more detail describe the algorithms (e.g. input/output formats), 2) try to classify them by patterns (e.g. with similar code structure), 3) offer the opportunity to simulate them in the browser (on toy-sized data sets) and 4) provide links to Hadoop implementations of them.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload