May 24

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research.

Update 2010-Nov-15: Amazon cluster compute instances enters 231th place on top 500 supercomputing list.

Update 2010-Jul-13: Can remove towards from the title of this posting today, Amazon just launched cluster compute instances with 10GB network bandwidth between nodes (and presents a run that enters top 500 list at 146th place, I estimate the run to cost ~$20k).

The Top 500 list is for supercomputers what Fortune 500 is for companies. About 80% of the list are supercomputers built by either Hewlett Packard or IBM, other major supercomputing vendors on the list include Dell, Sun (Oracle), Cray and SGI. Parallel linpack benchmark result is used as the ranking function for the list position (a derived list – green 500 – also includes power-efficiency in the ranking).

Trends towards Cloud Supercomputing
To our knowledge the entire top 500 list is currently based on physical supercomputer installations and no cloud computing configurations (i.e. virtual configurations lasting long enough to calculate the linpack benchmark), that will probably change within in a few years. There are however trends towards cloud-based supercomputing already (in particular within consumer internet services and pharmaceutical computations), here are some concrete examples:

  1. Zynga (online casual games, e.g. Farmville and Mafia Wars)
    Zynga uses 12000 Amazon EC2 nodes (ref: Manager of Cloud Operations at Zynga)

  2. Animoto (online video production service)
    Animoto scaled from 40 to 4000 EC2 nodes in 3 days (ref: CTO, Animoto)

  3. Myspace (social network)
    Myspace simulated 1 million simultaneous users using 800 large EC2 nodes (3200 cores) (ref: highscalability.com)

  4. New York Times
    New York Times used hundreds of EC2 nodes to process their archives in 36 hours (ref: The New York Times Archives + Amazon Web Services = TimesMachine)

  5. Reddit (news service)
    Reddit uses 218 EC2 nodes (ref: I run reddit’s servers)

Examples with (rough) estimates

  1. Justin.tv (video service)
    In october 2009 Justin.tv users watched 50 million hours of video, and they cost (reported earlier) was about 1 penny per user-video-hour, a very rough estimate would be monthly costs of 50M/0.01 = 500k$, i.e. 12*500k$ = 6M$ anually. Assuming that half their costs are computational, this would be about 3M$/(24*365*0.085) ~ 4029 EC2 nodes 24×7 through the year, but since they are a video site bandwidth is probably a significant fraction of the cost, so cutting the rough estimate in half to around 2000 EC2 nodes.
    (ref: Watching TV Together, Miles Apart and Justin.tv wins funding, opens platform)

  2. Newsweek
    Newsweek saves up to $500.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $500.000/(24h/day*365d/y*0.085$/h) ~ 670 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs)
    (ref: Newsweek.com Explores Amazon Cloud Computing)

  3. Recovery.gov
    Recory.gov saves up to $420.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $420.000/(24h/day*365d/y*0.085$/h) ~ 560 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs). (ref: Feds embrace cloud computing; move Recovery.gov to Amazon EC2)

Other examples of Cloud Supercomputing

  1. Pharmaceutical companies Eli Lilly, Johnson & Johnson and Genentech
    Offloading computations to the cloud (ref: Biotech HPC in the Cloud and The new computing pioneers)

  2. Pathwork Diagnostics
    Using EC2 for cancer diagnostics (ref: Of Unknown Origin: Diagnosing Cancer in the Cloud)

Atbrox on LinkedIn

Best regards,

Amund Tveit, co-founder of Atbrox

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 02

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
– Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
– Yahoo pushing env. on scalability
– Large clusters for academic research (yahoo, hp and intels open cirrus)
– nsf, ibm and google’s clue
– sigmod best paper award: Pig team from Yahoo
– worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
– file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
– Job Browser (Cluster Detail)
3) Cluster Health
– pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
– makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
– building a reusable API for dev. dekstop appl
– would like to capture innovation of ecosystem in a single interface
– desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
– data mining (log processing, clicks analysis)
– bioinformatics (genome analysis)
– financial simulation (monte carlo)
– file processing (resize jpegs, ocr) – a bit unexpected
– web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
– extended language to support S3
– specify off-instance-metadata store
– optimized data writes to S3
– reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
– deploy hadoop jobs to EMR
– monitor progress of EMR job flows
– amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
– can specify Cloudera’s distribution (and get support from Cloudera)
– in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
– continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
– run the latest version, currently 0.20
3) science and research (60% of HW)
– runs more stable versions, currently 0.20
4) production (20% of HW)
– the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
– 500M users, billions of “transactions”/day, Many petabytes of data
– analysis and data processing key to our business
– need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
– content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
– 70 hours runtime => 73 hours runtime
– 300TB shuffling => 490TB shuffling
– 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
– 1 terabyte 209 seconds => 62 seconds on 1500 nodes
– 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
– research questions answered in days, not months
– moved from research to prod easily

Major factors:
– don’t need to find new HW to experiment
– can work with all your data
– prod. and research on same framework
– no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
Hadoop:
– simplifies porting effort (between hadoop versions), freeze APIs, Avro
– GridMix3, Mumak simulator – for performance tuning
– quality engineering
Pig
– Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations
Oozie

1035 Rod Smith, IBM

Customer Scenarios
– BBC Digital Democracy project
– Thomson Reuters
– IBM Emerging Technology Projects: M2 (renamed later to M42)
– insight engine for ad-hoc business insights running ontop of Hadoop and Pig
– macro-support (e.g. extract patent information)
– collections (probably renamed to worksheets later)
– visualization (tag cloud)
– example 1: evaluate companies with patent information(1.4 million patents)
– using American Express as case study
– counting patent citations
– example 2: patents in litigation
– quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Hadoop
Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
– query your own formats and types with serializer/deserializer
– extend SQL functionality through user defined functions
– do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
– Hash-based Aggregations
– Map-Side Joins
– Predicate Pushdown
– Partition Pruning
– ++

Interoperability
– JDBC and ODBC interfaces available
– integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
– ++

Hive Information
– subproject of Hadoop

— Date Warehousing @ Hadoop —

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter 😀

Dimensions:
4800 cores, 5.5 PB,

Statistics per day:
– 4TB compr.data/day
– 135TB scanned per day
– 7500 Hive jobs/day
– 80K compute hours per day

Hive Simplifies Hadoop:
– New engineers go through a Hive training session
– 200 people/moth use it

Applications:
– reporting (daily/weekly aggregations of impression/click counts)
– measures of user engagement
– microstragy dashboards

Ad hoc analysis
– how many group admins broken down by state/country

Machine learning (assembling training data)
– ad optimization
– e.g. user engagement as function of user attributes

Facebook Hive contributions
– Hive, HDFS features, Scheduler work
– Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload