Other recommended writeups :
- Hadoop World NYC (Hilary Mason)
- The View from HadoopWorld (Stephen O’Grady)
- Post Hadoop World Thoughts (Deepak Singh)
- Hadoop World, NYC 2009 (Dan Milstein)
- Hadoop World Impressions (Steve Laniel)
—
Location: Roosevelt Hotel, NYC
1235 Joe Cunningham – Visa – Large scale transaction analysis
– responsible for Visa Technology Strategy and Innovation
been playing with Hadoop for 9 months
probably many in audience learning and starting out with Hadoop
Agenda:
1) VisaNet overview
2) Value-added information products
3) Hadoop@Visa – research results
About Visa:
- 60 Billion market cap
- well-known card products, and also behind the scene information products
- Visa brand has high trust
- For a card-holder a Visa-card means global acceptance
- For a shopowner, if you get a Visa payment aproval you will be payed
VisaNet
VisaNet is the largest, most advanced payment network in the world
characteristics:
28M locations,
130M authorizations/day,
1500 endpoints,
Processes transactions faster than 1s
1.4M ATMs,
Processes in 175 currencies,
Less than 2s unavailability per year (!)
- according to my calculations six 9s (0.999999366)
16300 financial institutions
Visa Processing Architecture
Security/Access Services -> Message|File|Web
VisaNet Services Integration -> Authorization|Clearing&Settlement
Dispute handling, Risk, Information
Scoring every transaction (used for issuer to approve/decline transaction)
Value added Info products
- Info services
Client: Portfolio Analysis, Visa Incentive Network
Accountholder: transaction alerts, accoutnt updater, tailored rewards
- Risk management services
Account monitoring
Authentication
Encyption
Hadoop@Visa
Run a pipeline of prototypes in lab facility in SF
Any technology taken into Visa needs to match scalability and reliability requirements
Research Lab Setup
- VM System:
Custom Analytic Stacks
Encryption Processing
Relational Database
- Hadoop Systems
Management Stack
Hadoop #1 ~40TB / 42 nodes (2 years of raw transaction data)
Hadoop #2 ~300TB / 28 nodes
Risk Product Use Case
Create critical data model elements, such as keys and transaction statistics, which feed our real-time risk-scoring systems
Input: Transactions – Merchant Category, Country/Zip
Output: Key & Statistics – MCCZIP Key – stats related to account, trans. type, approval, fraud, IP address etc.
Research Sample: 500M distinct accounts, 100M transactions per day, 200 bytes per transaction, 2 years – 73B transaction (36TB)
Processing time from 1 month to 13 minutes! (note: ~3000 times faster)
(Generate synthetic transactions used to test the model)
Financial Enterprise Fit
- key questions under research:
- what will the Hadoop Solution Stack(s) look like?
- File system, Transaction Sample System, Relational Back-end (integration path), Analytics Processing
- Internal vs external cloud
- How do I get data into a cloud in a secure way.
- How does HSM and security integration work in Hadoop
- What are the missing pieces?
Why Hadoop@Visa?
- analyze volumes of data with response that are not possible today
- requirement: need to fit with existing solutions
Cross Data Center Log Processing – Stu Hood, Rackspace
(Email and apps division, work on search team)
Agenda
Use Case Backgound
- “Rackapps” – Hybrid Mail Hosting, 40% use a mix of exchange and rackspace mail
Use Case: Log Types
Use Case: Querying
- was the mail delivered?
- spam – why was it (not) marked as spam
- access – who checked/failed to check mail?
more advanced questions:
- which delivery routes have the highest latency?
- which are the spammiest IP?
- Where in the world do customers log in from
Elsewhere:
- billing
Previous Solutions
- 1999-2006 – go to where log files are generated, querying with grep
- 2006-2007 / bulk load to MySQL – worked for a year
Hadoop Solution
- V3 – lucene indexes in Hadoop
- 2007- present
- store 7 days uncompressed
- queries take seconds
- long term queries with mapreduce (6M avail for MR queries)
- all 3 datacenters
Alternatives considered:
- Splunk – good for realtime, but not great for archiving
- Data warehouse package – not realtime, but fantastic for longterm analysis
- Partioned MySQL – half-baked solution
=> Hadoop hit the sweet spot
Hadoop Implementation
SW
- collect data using syslog-ng (considering Scribe)
- storage: deposits into Hadoop (scribe will remove that)
HW
- 2-4 collector machines per datacenters
- hundreds of source machines
20 solr nodes
Implementation: Indexing/Querying
- indexing – uniqe processing code for schema
- querying
- “realtime”
- sharded lucene/solr instances merge-index chunk from Hadoop
- using Solr-API
- raw logs
- using Hadoop Streaming and unix grep
- Mapreduce
Implementation: Timeframe
- development – 1.5 people in 3 months
- deployments – using clouderas distribution
- roadblocks – bumped into job-size limits
Have run close to 1 million jobs on our cluster, and it has not gone down (except for other reasons such as maintenance)
Advantages – storage
- all storage in one place
Raw logs: 3 days, in HDFS
Indexes: 7 days
Archived Indexes: 6 months
Advantages – analysis
- Java Mapreduce API
- Apache Pig
- ideal for one-off queries
- Hadoop Streaming
Pig Example – whitehouse.gov mail spoofing
Advantages – Scalability, Cost, Community
- scalability – easy to add nodes
- cost – only hardware
- community – cloudera has been a benefit, deployment is trivial
Data Processing for Financial Services – Peter Krey and Sin Lee, JP Morgan Chase
Innovation & Shared Services, Firmwide Engineering & Architecture
note: certain constraints what can be shared due to regulations
JPMorgen Chase + Open Source
- QPD (AMQP) – top level apache project
- Tyger – Apache + Tomcat + Spring
Hadoop in the Enterprise – Economics Driven
- attractive: economics
- Many big lessons from Web 2.0 community
- Potential for Large Capex and Opex “Dislocation”
- reduce consumption of enterprise premium resources
- grid computing economics brought to data intensive computing
- stagnant data innovation
- Enabling & potentially disruptive platform
- many historical similarities
- java, linux, tomcat, web/internet
- minis to client/server, client/server to web, solaris to linux, ..
- Key question: what can be built on top of Hadoop?
Back to economics driven – very cost-effective
Hadoop in the Enterprise – Choice Driven
- Overuse of relational database containers
- institutional “Muscle memory” – not too much else to choose from
- increasingly large percentage of static data stored in proprietary transactional DBs
- Over-Normalized Schemas: still Makes sense with cheap compute&storage?
- Enterprise Storage “Prisoners”
- Captive to the economics & technology of “a few” vendors
- Developers need more choice
- Too much proprietary, single-source data infrastructure
- increasing need for minimal/no systems + storage admins
Hadoop in the Enterprise – Other Drivers
- Growing developer interest in “Reduced RDBMS” Data technologies
- open source, distributed, non-relational databases
- growing influence of web 2.0 technologies & thinking of enterprise
- hadoop, cassandra, hbase, hive, couchdb, hadoopDB, .. , others
- memcached for caching
FSI Industry Drivers
- Increased regularity oversight + reporting = More data needed over longer period of time
- triple data amounts from 2007 to 2009
- growing need for less expensive data repository/store
- increased need to support “one off” analysis on large data
Active POC Pipeline
- Growing stream of real projects to gauge hadoop “goodness of fit”
- broad spectrum of use cases
- driven by need to impact/dislocate OPEX+CAPEX
- looking for orders of magnitude
- evaluated on metric based performance, functional and economic measures
- avoid the “data falling on the floor phenomena”
- tools are really really important, keep tools and programming models simple
Hadoop Positiong
- Latency x Storage amount curve,
Cost comparisons
- SAN vs Hadoop HDFS cost comparison (GB/month)
- Hadoop much cheaper
Hadoop Additions and Must Haves:
- Improves SQL Front-End Tool Interoperability
- Improved Security & ACL enforcement – Kerberos Integration
- Grow Developer Programming Model Skill Sets
- Improve Relational Container Integration & Interop for Data Archival
- Management & Monitoring Tools
- Improved Developer & Debugging Tools
- Reduce Latency via integration with open source data caching
- memcached – others
- Invitation to FSI or Enterprise roundtable
Protein Alignment – Paul Brown, Booz Allen
Biological information
- Body – Cells – Chromosomes – Gene – DNA/RNA
Bioinformatics – The Pain
- too much data
So What? Querying a database of sequences for similar sequences
- one-to-many comparison
- 58000 proteins in PDB
- Protein alignment frequently used in the development of medicines
- Looking for a certain sequence across species, helps indicate function
Implementation in Hadoop
- distribute database sequence accross each node
- send query seq. inside Mapreduce (or dist.cache)
- scales well
- existing algorithms port easily
So What? Comparing sequences in bulk
- many-to-many
- DNA hybridiation (reconstruction)
Ran on AWS
Hadoop:
- if whole dataset fit into one computer
- Used distributed cache, assign each node a piece of the list
- But if the does not fit on one computer….
- pre-join all possible pairs with one MapReduce
So What? Analyzing really big sequences
- one big sequence to many small sequences
- scanning dna for structure
- population genetics
- hadoop implementatoin
Demonstration Implementation: Smith-Waterman Alignment
- one of the more computationally intensive matching and aligmnent techniques
- big matrix – (sequences to compare on row and column and calculations within)
Amazon implementation
- 250 machines
- E2
- run in 10 minutes for a single sequence. Runs in 24hrs for NxN comparison
- cost $40/hr
==> very cool 3D video of amazon ec2 nodes
- failing job due to 10% of nodes stuck on something (e.g. very long sequences)
Real-time Business Intelligence, Bradford Stephens
Topics
- Scalability and BI
- Costs and Abilities
- Search as BI
Tools: Zookeeper, Hbase, Katta (dist.search on Hadoop) and Bobo (faceted search for lucene)
- http://sourceforge.net/projects/bobo-browse/
- http://sourceforge.net/projects/katta/develop
100TB structured and unstructed data – Oracle 100M$, Hadoop and Katta 250K$
Building data cubes in real time (with faceted search)
Real-time Mapreduce on HBase
Search/BI as a platform – “google my datawarehouse”
Counting, Clustering and other data tricks, Derek Gottfried, New York Times
back in 2007 – would like to try as many EC2 instances as possible
Problem
- freeing up historical archives of NYTimes.com (1851-1922)
(in the public domain)
Currently:
- 2009 – web analytics
3 big data buckets:
1) registration/demographics
2) articles 1851-today
- a lot of metadata about each article
- unique data, extract people, places, .. to each article => high precision search
3) usage data/web logs
- biggest piece – piles up
How do we merge the 3 datasets?
Using EC2 – 20 machines
Hadoop 0.20.0
12 TB of data
Straight MR in Java
(mostly java + postprocessing in python)
combining weblog data with demographic data, e.g. twitter clicks backs by age group
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce













October 5th, 2009 at 23:28
[...] Atbrox has notes from the morning session and the application session [...]