Hadoop World 2009 – some notes from morning session

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
– Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
– Yahoo pushing env. on scalability
– Large clusters for academic research (yahoo, hp and intels open cirrus)
– nsf, ibm and google’s clue
– sigmod best paper award: Pig team from Yahoo
– worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
– file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
– Job Browser (Cluster Detail)
3) Cluster Health
– pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
– makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
– building a reusable API for dev. dekstop appl
– would like to capture innovation of ecosystem in a single interface
– desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
– data mining (log processing, clicks analysis)
– bioinformatics (genome analysis)
– financial simulation (monte carlo)
– file processing (resize jpegs, ocr) – a bit unexpected
– web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
– extended language to support S3
– specify off-instance-metadata store
– optimized data writes to S3
– reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
– deploy hadoop jobs to EMR
– monitor progress of EMR job flows
– amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
– can specify Cloudera’s distribution (and get support from Cloudera)
– in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
– continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
– run the latest version, currently 0.20
3) science and research (60% of HW)
– runs more stable versions, currently 0.20
4) production (20% of HW)
– the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
– 500M users, billions of “transactions”/day, Many petabytes of data
– analysis and data processing key to our business
– need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
– content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
– 70 hours runtime => 73 hours runtime
– 300TB shuffling => 490TB shuffling
– 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
– 1 terabyte 209 seconds => 62 seconds on 1500 nodes
– 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
– research questions answered in days, not months
– moved from research to prod easily

Major factors:
– don’t need to find new HW to experiment
– can work with all your data
– prod. and research on same framework
– no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
– simplifies porting effort (between hadoop versions), freeze APIs, Avro
– GridMix3, Mumak simulator – for performance tuning
– quality engineering
– Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations

1035 Rod Smith, IBM

Customer Scenarios
– BBC Digital Democracy project
– Thomson Reuters
– IBM Emerging Technology Projects: M2 (renamed later to M42)
– insight engine for ad-hoc business insights running ontop of Hadoop and Pig
– macro-support (e.g. extract patent information)
– collections (probably renamed to worksheets later)
– visualization (tag cloud)
– example 1: evaluate companies with patent information(1.4 million patents)
– using American Express as case study
– counting patent citations
– example 2: patents in litigation
– quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
– query your own formats and types with serializer/deserializer
– extend SQL functionality through user defined functions
– do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
– Hash-based Aggregations
– Map-Side Joins
– Predicate Pushdown
– Partition Pruning
– ++

– JDBC and ODBC interfaces available
– integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
– ++

Hive Information
– subproject of Hadoop

— Date Warehousing @ Hadoop —

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter 😀

4800 cores, 5.5 PB,

Statistics per day:
– 4TB compr.data/day
– 135TB scanned per day
– 7500 Hive jobs/day
– 80K compute hours per day

Hive Simplifies Hadoop:
– New engineers go through a Hive training session
– 200 people/moth use it

– reporting (daily/weekly aggregations of impression/click counts)
– measures of user engagement
– microstragy dashboards

Ad hoc analysis
– how many group admins broken down by state/country

Machine learning (assembling training data)
– ad optimization
– e.g. user engagement as function of user attributes

Facebook Hive contributions
– Hive, HDFS features, Scheduler work
– Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model

Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

This entry was posted in cloud computing and tagged , , , , , . Bookmark the permalink.

3 Responses to Hadoop World 2009 – some notes from morning session

  1. Pingback: Hadoop World NYC

  2. Pingback: » Hadoop World NYC hilarymason.com