Hadoop World 2009 – some notes from application session

Other recommended writeups :

Location: Roosevelt Hotel, NYC

1235 Joe Cunningham – Visa – Large scale transaction analysis
– responsible for Visa Technology Strategy and Innovation
been playing with Hadoop for 9 months
probably many in audience learning and starting out with Hadoop

1) VisaNet overview
2) Value-added information products
3) Hadoop@Visa – research results

About Visa:
– 60 Billion market cap
– well-known card products, and also behind the scene information products
– Visa brand has high trust
– For a card-holder a Visa-card means global acceptance
– For a shopowner, if you get a Visa payment aproval you will be payed

VisaNet is the largest, most advanced payment network in the world
28M locations,
130M authorizations/day,
1500 endpoints,
Processes transactions faster than 1s
1.4M ATMs,
Processes in 175 currencies,
Less than 2s unavailability per year (!)
– according to my calculations six 9s (0.999999366)
16300 financial institutions

Visa Processing Architecture
Security/Access Services -> Message|File|Web
VisaNet Services Integration -> Authorization|Clearing&Settlement
Dispute handling, Risk, Information
Scoring every transaction (used for issuer to approve/decline transaction)

Value added Info products
– Info services
Client: Portfolio Analysis, Visa Incentive Network
Accountholder: transaction alerts, accoutnt updater, tailored rewards
– Risk management services
Account monitoring

Run a pipeline of prototypes in lab facility in SF
Any technology taken into Visa needs to match scalability and reliability requirements

Research Lab Setup
– VM System:
Custom Analytic Stacks
Encryption Processing
Relational Database
– Hadoop Systems
Management Stack
Hadoop #1 ~40TB / 42 nodes (2 years of raw transaction data)
Hadoop #2 ~300TB / 28 nodes

Risk Product Use Case
Create critical data model elements, such as keys and transaction statistics, which feed our real-time risk-scoring systems
Input: Transactions – Merchant Category, Country/Zip
Output: Key & Statistics – MCCZIP Key – stats related to account, trans. type, approval, fraud, IP address etc.
Research Sample: 500M distinct accounts, 100M transactions per day, 200 bytes per transaction, 2 years – 73B transaction (36TB)
Processing time from 1 month to 13 minutes! (note: ~3000 times faster)
(Generate synthetic transactions used to test the model)

Financial Enterprise Fit
– key questions under research:
– what will the Hadoop Solution Stack(s) look like?
– File system, Transaction Sample System, Relational Back-end (integration path), Analytics Processing
– Internal vs external cloud
– How do I get data into a cloud in a secure way.
– How does HSM and security integration work in Hadoop
– What are the missing pieces?

Why Hadoop@Visa?
– analyze volumes of data with response that are not possible today
– requirement: need to fit with existing solutions

Cross Data Center Log Processing – Stu Hood, Rackspace

(Email and apps division, work on search team)

Use Case Backgound
– “Rackapps” – Hybrid Mail Hosting, 40% use a mix of exchange and rackspace mail

Use Case: Log Types

Use Case: Querying
– was the mail delivered?
– spam – why was it (not) marked as spam
– access – who checked/failed to check mail?
more advanced questions:
– which delivery routes have the highest latency?
– which are the spammiest IP?
– Where in the world do customers log in from
– billing

Previous Solutions
– 1999-2006 – go to where log files are generated, querying with grep
– 2006-2007 / bulk load to MySQL – worked for a year

Hadoop Solution
– V3 – lucene indexes in Hadoop
– 2007- present
– store 7 days uncompressed
– queries take seconds
– long term queries with mapreduce (6M avail for MR queries)
– all 3 datacenters

Alternatives considered:
– Splunk – good for realtime, but not great for archiving
– Data warehouse package – not realtime, but fantastic for longterm analysis
– Partioned MySQL – half-baked solution
=> Hadoop hit the sweet spot

Hadoop Implementation
– collect data using syslog-ng (considering Scribe)
– storage: deposits into Hadoop (scribe will remove that)
– 2-4 collector machines per datacenters
– hundreds of source machines
20 solr nodes

Implementation: Indexing/Querying
– indexing – uniqe processing code for schema
– querying
– “realtime”
– sharded lucene/solr instances merge-index chunk from Hadoop
– using Solr-API
– raw logs
– using Hadoop Streaming and unix grep
– Mapreduce

Implementation: Timeframe
– development – 1.5 people in 3 months
– deployments – using clouderas distribution
– roadblocks – bumped into job-size limits

Have run close to 1 million jobs on our cluster, and it has not gone down (except for other reasons such as maintenance)

Advantages – storage
– all storage in one place
Raw logs: 3 days, in HDFS
Indexes: 7 days
Archived Indexes: 6 months

Advantages – analysis
– Java Mapreduce API
– Apache Pig
– ideal for one-off queries
– Hadoop Streaming

Pig Example – whitehouse.gov mail spoofing

Advantages – Scalability, Cost, Community
– scalability – easy to add nodes
– cost – only hardware
– community – cloudera has been a benefit, deployment is trivial

Data Processing for Financial Services – Peter Krey and Sin Lee, JP Morgan Chase

Innovation & Shared Services, Firmwide Engineering & Architecture

note: certain constraints what can be shared due to regulations

JPMorgen Chase + Open Source
– QPD (AMQP) – top level apache project
– Tyger – Apache + Tomcat + Spring

Hadoop in the Enterprise – Economics Driven
– attractive: economics
– Many big lessons from Web 2.0 community
– Potential for Large Capex and Opex “Dislocation”
– reduce consumption of enterprise premium resources
– grid computing economics brought to data intensive computing
– stagnant data innovation
– Enabling & potentially disruptive platform
– many historical similarities
– java, linux, tomcat, web/internet
– minis to client/server, client/server to web, solaris to linux, ..
– Key question: what can be built on top of Hadoop?
Back to economics driven – very cost-effective

Hadoop in the Enterprise – Choice Driven
– Overuse of relational database containers
– institutional “Muscle memory” – not too much else to choose from
– increasingly large percentage of static data stored in proprietary transactional DBs
– Over-Normalized Schemas: still Makes sense with cheap compute&storage?

– Enterprise Storage “Prisoners”
– Captive to the economics & technology of “a few” vendors
– Developers need more choice
– Too much proprietary, single-source data infrastructure
– increasing need for minimal/no systems + storage admins

Hadoop in the Enterprise – Other Drivers
– Growing developer interest in “Reduced RDBMS” Data technologies
– open source, distributed, non-relational databases
– growing influence of web 2.0 technologies & thinking of enterprise
– hadoop, cassandra, hbase, hive, couchdb, hadoopDB, .. , others
– memcached for caching

FSI Industry Drivers
– Increased regularity oversight + reporting = More data needed over longer period of time
– triple data amounts from 2007 to 2009
– growing need for less expensive data repository/store
– increased need to support “one off” analysis on large data

Active POC Pipeline
– Growing stream of real projects to gauge hadoop “goodness of fit”
– broad spectrum of use cases
– driven by need to impact/dislocate OPEX+CAPEX
– looking for orders of magnitude
– evaluated on metric based performance, functional and economic measures
– avoid the “data falling on the floor phenomena”
– tools are really really important, keep tools and programming models simple

Hadoop Positiong
– Latency x Storage amount curve,

Cost comparisons
– SAN vs Hadoop HDFS cost comparison (GB/month)
– Hadoop much cheaper

Hadoop Additions and Must Haves:
– Improves SQL Front-End Tool Interoperability
– Improved Security & ACL enforcement – Kerberos Integration
– Grow Developer Programming Model Skill Sets
– Improve Relational Container Integration & Interop for Data Archival
– Management & Monitoring Tools
– Improved Developer & Debugging Tools
– Reduce Latency via integration with open source data caching
– memcached – others
– Invitation to FSI or Enterprise roundtable

Protein Alignment – Paul Brown, Booz Allen

Biological information
– Body – Cells – Chromosomes – Gene – DNA/RNA

Bioinformatics – The Pain
– too much data

So What? Querying a database of sequences for similar sequences
– one-to-many comparison
– 58000 proteins in PDB
– Protein alignment frequently used in the development of medicines
– Looking for a certain sequence across species, helps indicate function
Implementation in Hadoop
– distribute database sequence accross each node
– send query seq. inside Mapreduce (or dist.cache)
– scales well
– existing algorithms port easily

So What? Comparing sequences in bulk
– many-to-many
– DNA hybridiation (reconstruction)
Ran on AWS
– if whole dataset fit into one computer
– Used distributed cache, assign each node a piece of the list
– But if the does not fit on one computer….
– pre-join all possible pairs with one MapReduce

So What? Analyzing really big sequences
– one big sequence to many small sequences
– scanning dna for structure
– population genetics
– hadoop implementatoin

Demonstration Implementation: Smith-Waterman Alignment
– one of the more computationally intensive matching and aligmnent techniques
– big matrix – (sequences to compare on row and column and calculations within)

Amazon implementation
– 250 machines
– E2
– run in 10 minutes for a single sequence. Runs in 24hrs for NxN comparison
– cost $40/hr

==> very cool 3D video of amazon ec2 nodes
– failing job due to 10% of nodes stuck on something (e.g. very long sequences)

Real-time Business Intelligence, Bradford Stephens

– Scalability and BI
– Costs and Abilities
– Search as BI

Tools: Zookeeper, Hbase, Katta (dist.search on Hadoop) and Bobo (faceted search for lucene)
– http://sourceforge.net/projects/bobo-browse/
– http://sourceforge.net/projects/katta/develop

100TB structured and unstructed data – Oracle 100M$, Hadoop and Katta 250K$

Building data cubes in real time (with faceted search)

Real-time Mapreduce on HBase
Search/BI as a platform – “google my datawarehouse”

Counting, Clustering and other data tricks, Derek Gottfried, New York Times

back in 2007 – would like to try as many EC2 instances as possible

– freeing up historical archives of NYTimes.com (1851-1922)
(in the public domain)

– 2009 – web analytics
3 big data buckets:
1) registration/demographics
2) articles 1851-today
– a lot of metadata about each article
– unique data, extract people, places, .. to each article => high precision search
3) usage data/web logs
– biggest piece – piles up

How do we merge the 3 datasets?

Using EC2 – 20 machines
Hadoop 0.20.0
12 TB of data
Straight MR in Java
(mostly java + postprocessing in python)

combining weblog data with demographic data, e.g. twitter clicks backs by age group

Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

This entry was posted in cloud computing and tagged , , , , . Bookmark the permalink.

One Response to Hadoop World 2009 – some notes from application session

  1. Pingback: Hadoop World NYC