Feb 28

Yesterday Yahoo announced the Learning to Rank Challenge – a pretty interesting challenge (as the somewhat similar Netflix Prize Challenge also was).

Data and Problem
The data sets contains (to my interpretation) per line:

  1. url – implicitly encoded as line number in the data set file
  2. relevance – low number=high relevance and vice versa
  3. query – represented as an id
  4. features – up to several hundreds

and the problem is to find a function that gives relevance numbers per url per query id.

Initial Observation
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.

Multiple URLs that shares relevance for a unique query can perhaps be due to:

  1. similar/duplicate content between the URLs?
  2. a frequent query (due to sampling of examples?)
  3. uncertainty about which URL to select for particular a relevance and query?
  4. there is a tie, i.e. they are equally relevant

Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:

  1. Use relevance levels as classes (nominal regression) and use a multiclass-classifier
  2. Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
  3. Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
    1. support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
    2. use a classifier that supports weighing examples and the approach in 1 or 2.

Conclusion
Still have more questions than answers, so next step is the learning to rank bibliography.

Tagged with:
Feb 17

Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or EU Data Retention Directive), Hadoop becomes increasingly relevant.

Hadoop Technologies

Several Hadoop technologies are inspired by Google’s infrastructure.

1. Processing and Storage

1.1 Processing – Mapreduce
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions – Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop. In order to use Mapreduce developers need to parallelize their problem and program against an API – here for an example of machine learning with Hadoop. Hadoop’s Mapreduce is inspired by the paper MapReduce: Simplified Data Processing on Large Clusters.

1.2 File Storage – HDFS
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper The Google File System

1.3 Database – HBase
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see this presentation about Hbase. HBase is inspired by the paper Bigtable: A Distributed Storage System for Structured Data

2. Data Analysis

Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:

2.1 Pig
SQL-like language/system running on top of Mapreduce. Pig is developed by Yahoo and inspired by the paper Interpreting the Data: Parallel Analysis with Sawzall

2.2 Hive
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.

3. Distributed Systems Development

3.1 Avro
Avro is used for efficient serialization of data and communication between services. It is in several ways similar to Google’s protocolbuffers and Facebook’s Thrift.

3.2 Zookeeper
Coordination between distributed processes. It is inspired by the paper The Chubby lock service for loosely-coupled distributed systems

3.3 Chukwa
Monitoring of distributed systems.

Tagged with:
Oct 02

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
- Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
- Yahoo pushing env. on scalability
- Large clusters for academic research (yahoo, hp and intels open cirrus)
- nsf, ibm and google’s clue
- sigmod best paper award: Pig team from Yahoo
- worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
- file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
- Job Browser (Cluster Detail)
3) Cluster Health
- pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
- makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
- building a reusable API for dev. dekstop appl
- would like to capture innovation of ecosystem in a single interface
- desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
- data mining (log processing, clicks analysis)
- bioinformatics (genome analysis)
- financial simulation (monte carlo)
- file processing (resize jpegs, ocr) – a bit unexpected
- web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
- extended language to support S3
- specify off-instance-metadata store
- optimized data writes to S3
- reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
- deploy hadoop jobs to EMR
- monitor progress of EMR job flows
- amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
- can specify Cloudera’s distribution (and get support from Cloudera)
- in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
- continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
- run the latest version, currently 0.20
3) science and research (60% of HW)
- runs more stable versions, currently 0.20
4) production (20% of HW)
- the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
- 500M users, billions of “transactions”/day, Many petabytes of data
- analysis and data processing key to our business
- need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
- content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
- 70 hours runtime => 73 hours runtime
- 300TB shuffling => 490TB shuffling
- 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
- 1 terabyte 209 seconds => 62 seconds on 1500 nodes
- 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
- research questions answered in days, not months
- moved from research to prod easily

Major factors:
- don’t need to find new HW to experiment
- can work with all your data
- prod. and research on same framework
- no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
Hadoop:
- simplifies porting effort (between hadoop versions), freeze APIs, Avro
- GridMix3, Mumak simulator – for performance tuning
- quality engineering
Pig
- Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations
Oozie

1035 Rod Smith, IBM

Customer Scenarios
- BBC Digital Democracy project
- Thomson Reuters
- IBM Emerging Technology Projects: M2 (renamed later to M42)
- insight engine for ad-hoc business insights running ontop of Hadoop and Pig
- macro-support (e.g. extract patent information)
- collections (probably renamed to worksheets later)
- visualization (tag cloud)
- example 1: evaluate companies with patent information(1.4 million patents)
- using American Express as case study
- counting patent citations
- example 2: patents in litigation
- quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Hadoop
Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
- query your own formats and types with serializer/deserializer
- extend SQL functionality through user defined functions
- do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
- Hash-based Aggregations
- Map-Side Joins
- Predicate Pushdown
- Partition Pruning
- ++

Interoperability
- JDBC and ODBC interfaces available
- integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
- ++

Hive Information
- subproject of Hadoop

— Date Warehousing @ Hadoop –

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter :D

Dimensions:
4800 cores, 5.5 PB,

Statistics per day:
- 4TB compr.data/day
- 135TB scanned per day
- 7500 Hive jobs/day
- 80K compute hours per day

Hive Simplifies Hadoop:
- New engineers go through a Hive training session
- 200 people/moth use it

Applications:
- reporting (daily/weekly aggregations of impression/click counts)
- measures of user engagement
- microstragy dashboards

Ad hoc analysis
- how many group admins broken down by state/country

Machine learning (assembling training data)
- ad optimization
- e.g. user engagement as function of user attributes

Facebook Hive contributions
- Hive, HDFS features, Scheduler work
- Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model

Tagged with:
preload preload preload