Oct 27

SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for  a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants “heavier data” in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)

Unstructured Search for SimpleDB

This posting presents an approach of how to add (flexible) unstructured search support to SimpleDB (with some preliminary query latency numbers below – and very preliminary python code). The motivation is:
  1. Support unstructured search with very low maintenance
  2. Combine structured and unstructured search
  3. Figure out the feasibility of unstructured search on top of SimpleDB

The Structure of SimpleDB

SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable)  has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name). Check out Wikipedia for detailed SimpleDB storage characteristics.

Inverted files

Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey – Inverted files for text search engines)

One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from “0” to “249” and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item – e.g. for stop words – one could “wrap” the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).

Preliminary query latency results

Warning: Data sets used were  NLTK‘s inaugural collection, so far from the biggest.

Inverted File Entry Fetch latency Distribution (in seconds)

Conclusion: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.

Preliminary Python code including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 07

Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:

  1. Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
  2. Skim quickly through the Shedskin tutorial
  3. Log into the EC2 node and install the Shedskin Python compiler
  4. Write your Python mapper or reducer program and compile it into C++ with Shedskin
    • E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
  5. Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
    • note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
  6. Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
  7. Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
  8. Run strip on the binary to make it smaller
  9. Upload your (ready) binary to a chosen location in Amazon S3
  10. Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.

Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.

Note: this approach should probably work also with Cloudera’s distribution for Hadoop.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or

contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 02

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
– Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
– Yahoo pushing env. on scalability
– Large clusters for academic research (yahoo, hp and intels open cirrus)
– nsf, ibm and google’s clue
– sigmod best paper award: Pig team from Yahoo
– worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
– file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
– Job Browser (Cluster Detail)
3) Cluster Health
– pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
– makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
– building a reusable API for dev. dekstop appl
– would like to capture innovation of ecosystem in a single interface
– desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
– data mining (log processing, clicks analysis)
– bioinformatics (genome analysis)
– financial simulation (monte carlo)
– file processing (resize jpegs, ocr) – a bit unexpected
– web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
– extended language to support S3
– specify off-instance-metadata store
– optimized data writes to S3
– reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
– deploy hadoop jobs to EMR
– monitor progress of EMR job flows
– amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
– can specify Cloudera’s distribution (and get support from Cloudera)
– in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
– continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
– run the latest version, currently 0.20
3) science and research (60% of HW)
– runs more stable versions, currently 0.20
4) production (20% of HW)
– the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
– 500M users, billions of “transactions”/day, Many petabytes of data
– analysis and data processing key to our business
– need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
– content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
– 70 hours runtime => 73 hours runtime
– 300TB shuffling => 490TB shuffling
– 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
– 1 terabyte 209 seconds => 62 seconds on 1500 nodes
– 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
– research questions answered in days, not months
– moved from research to prod easily

Major factors:
– don’t need to find new HW to experiment
– can work with all your data
– prod. and research on same framework
– no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
Hadoop:
– simplifies porting effort (between hadoop versions), freeze APIs, Avro
– GridMix3, Mumak simulator – for performance tuning
– quality engineering
Pig
– Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations
Oozie

1035 Rod Smith, IBM

Customer Scenarios
– BBC Digital Democracy project
– Thomson Reuters
– IBM Emerging Technology Projects: M2 (renamed later to M42)
– insight engine for ad-hoc business insights running ontop of Hadoop and Pig
– macro-support (e.g. extract patent information)
– collections (probably renamed to worksheets later)
– visualization (tag cloud)
– example 1: evaluate companies with patent information(1.4 million patents)
– using American Express as case study
– counting patent citations
– example 2: patents in litigation
– quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Hadoop
Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
– query your own formats and types with serializer/deserializer
– extend SQL functionality through user defined functions
– do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
– Hash-based Aggregations
– Map-Side Joins
– Predicate Pushdown
– Partition Pruning
– ++

Interoperability
– JDBC and ODBC interfaces available
– integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
– ++

Hive Information
– subproject of Hadoop

— Date Warehousing @ Hadoop —

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter 😀

Dimensions:
4800 cores, 5.5 PB,

Statistics per day:
– 4TB compr.data/day
– 135TB scanned per day
– 7500 Hive jobs/day
– 80K compute hours per day

Hive Simplifies Hadoop:
– New engineers go through a Hive training session
– 200 people/moth use it

Applications:
– reporting (daily/weekly aggregations of impression/click counts)
– measures of user engagement
– microstragy dashboards

Ad hoc analysis
– how many group admins broken down by state/country

Machine learning (assembling training data)
– ad optimization
– e.g. user engagement as function of user attributes

Facebook Hive contributions
– Hive, HDFS features, Scheduler work
– Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Sep 07

We are here to help you:

  • Understand if and how the cloud can be cost-efficient in your setting
  • Efficiently analyze large data sets using the cloud
  • Architect, develop and deploy scalable and reliable software for the cloud
  • Adapt and migrate your existing data and software to the cloud

Technologies and methods we (non-exclusively) use:

Our motto is Simplicity, Automation and Scalability

If you are considering using cloud computing, please drop us a line to info (at) atbrox.com

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload