atbrox | Accelerates Innovation with Code

Hadoop World 2009 – some notes from application session

Posted on October 3, 2009 by Amund Tveit

Other recommended writeups :

Hadoop World NYC (Hilary Mason)
The View from HadoopWorld (Stephen O’Grady)
Post Hadoop World Thoughts (Deepak Singh)
Hadoop World, NYC 2009 (Dan Milstein)
Hadoop World Impressions (Steve Laniel)

—

Location: Roosevelt Hotel, NYC

1235 Joe Cunningham – Visa – Large scale transaction analysis
– responsible for Visa Technology Strategy and Innovation
been playing with Hadoop for 9 months
probably many in audience learning and starting out with Hadoop

Agenda:
1) VisaNet overview
2) Value-added information products
3) Hadoop@Visa – research results

About Visa:
– 60 Billion market cap
– well-known card products, and also behind the scene information products
– Visa brand has high trust
– For a card-holder a Visa-card means global acceptance
– For a shopowner, if you get a Visa payment aproval you will be payed

VisaNet
VisaNet is the largest, most advanced payment network in the world
characteristics:
28M locations,
130M authorizations/day,
1500 endpoints,
Processes transactions faster than 1s
1.4M ATMs,
Processes in 175 currencies,
Less than 2s unavailability per year (!)
– according to my calculations six 9s (0.999999366)
16300 financial institutions

Visa Processing Architecture
Security/Access Services -> Message|File|Web
VisaNet Services Integration -> Authorization|Clearing&Settlement
Dispute handling, Risk, Information
Scoring every transaction (used for issuer to approve/decline transaction)

Value added Info products
– Info services
Client: Portfolio Analysis, Visa Incentive Network
Accountholder: transaction alerts, accoutnt updater, tailored rewards
– Risk management services
Account monitoring
Authentication
Encyption

Hadoop@Visa
Run a pipeline of prototypes in lab facility in SF
Any technology taken into Visa needs to match scalability and reliability requirements

Research Lab Setup
– VM System:
Custom Analytic Stacks
Encryption Processing
Relational Database
– Hadoop Systems
Management Stack
Hadoop #1 ~40TB / 42 nodes (2 years of raw transaction data)
Hadoop #2 ~300TB / 28 nodes

Risk Product Use Case
Create critical data model elements, such as keys and transaction statistics, which feed our real-time risk-scoring systems
Input: Transactions – Merchant Category, Country/Zip
Output: Key & Statistics – MCCZIP Key – stats related to account, trans. type, approval, fraud, IP address etc.
Research Sample: 500M distinct accounts, 100M transactions per day, 200 bytes per transaction, 2 years – 73B transaction (36TB)
Processing time from 1 month to 13 minutes! (note: ~3000 times faster)
(Generate synthetic transactions used to test the model)

Financial Enterprise Fit
– key questions under research:
– what will the Hadoop Solution Stack(s) look like?
– File system, Transaction Sample System, Relational Back-end (integration path), Analytics Processing
– Internal vs external cloud
– How do I get data into a cloud in a secure way.
– How does HSM and security integration work in Hadoop
– What are the missing pieces?

Why Hadoop@Visa?
– analyze volumes of data with response that are not possible today
– requirement: need to fit with existing solutions

Cross Data Center Log Processing – Stu Hood, Rackspace

(Email and apps division, work on search team)

Agenda
Use Case Backgound
– “Rackapps” – Hybrid Mail Hosting, 40% use a mix of exchange and rackspace mail

Use Case: Log Types

Use Case: Querying
– was the mail delivered?
– spam – why was it (not) marked as spam
– access – who checked/failed to check mail?
more advanced questions:
– which delivery routes have the highest latency?
– which are the spammiest IP?
– Where in the world do customers log in from
Elsewhere:
– billing

Previous Solutions
– 1999-2006 – go to where log files are generated, querying with grep
– 2006-2007 / bulk load to MySQL – worked for a year

Hadoop Solution
– V3 – lucene indexes in Hadoop
– 2007- present
– store 7 days uncompressed
– queries take seconds
– long term queries with mapreduce (6M avail for MR queries)
– all 3 datacenters

Alternatives considered:
– Splunk – good for realtime, but not great for archiving
– Data warehouse package – not realtime, but fantastic for longterm analysis
– Partioned MySQL – half-baked solution
=> Hadoop hit the sweet spot

Hadoop Implementation
SW
– collect data using syslog-ng (considering Scribe)
– storage: deposits into Hadoop (scribe will remove that)
HW
– 2-4 collector machines per datacenters
– hundreds of source machines
20 solr nodes

Implementation: Indexing/Querying
– indexing – uniqe processing code for schema
– querying
– “realtime”
– sharded lucene/solr instances merge-index chunk from Hadoop
– using Solr-API
– raw logs
– using Hadoop Streaming and unix grep
– Mapreduce

Implementation: Timeframe
– development – 1.5 people in 3 months
– deployments – using clouderas distribution
– roadblocks – bumped into job-size limits

Have run close to 1 million jobs on our cluster, and it has not gone down (except for other reasons such as maintenance)

Advantages – storage
– all storage in one place
Raw logs: 3 days, in HDFS
Indexes: 7 days
Archived Indexes: 6 months

Advantages – analysis
– Java Mapreduce API
– Apache Pig
– ideal for one-off queries
– Hadoop Streaming

Pig Example – whitehouse.gov mail spoofing

Advantages – Scalability, Cost, Community
– scalability – easy to add nodes
– cost – only hardware
– community – cloudera has been a benefit, deployment is trivial

Data Processing for Financial Services – Peter Krey and Sin Lee, JP Morgan Chase

Innovation & Shared Services, Firmwide Engineering & Architecture

note: certain constraints what can be shared due to regulations

JPMorgen Chase + Open Source
– QPD (AMQP) – top level apache project
– Tyger – Apache + Tomcat + Spring

Hadoop in the Enterprise – Economics Driven
– attractive: economics
– Many big lessons from Web 2.0 community
– Potential for Large Capex and Opex “Dislocation”
– reduce consumption of enterprise premium resources
– grid computing economics brought to data intensive computing
– stagnant data innovation
– Enabling & potentially disruptive platform
– many historical similarities
– java, linux, tomcat, web/internet
– minis to client/server, client/server to web, solaris to linux, ..
– Key question: what can be built on top of Hadoop?
Back to economics driven – very cost-effective

Hadoop in the Enterprise – Choice Driven
– Overuse of relational database containers
– institutional “Muscle memory” – not too much else to choose from
– increasingly large percentage of static data stored in proprietary transactional DBs
– Over-Normalized Schemas: still Makes sense with cheap compute&storage?

– Enterprise Storage “Prisoners”
– Captive to the economics & technology of “a few” vendors
– Developers need more choice
– Too much proprietary, single-source data infrastructure
– increasing need for minimal/no systems + storage admins

Hadoop in the Enterprise – Other Drivers
– Growing developer interest in “Reduced RDBMS” Data technologies
– open source, distributed, non-relational databases
– growing influence of web 2.0 technologies & thinking of enterprise
– hadoop, cassandra, hbase, hive, couchdb, hadoopDB, .. , others
– memcached for caching

FSI Industry Drivers
– Increased regularity oversight + reporting = More data needed over longer period of time
– triple data amounts from 2007 to 2009
– growing need for less expensive data repository/store
– increased need to support “one off” analysis on large data

Active POC Pipeline
– Growing stream of real projects to gauge hadoop “goodness of fit”
– broad spectrum of use cases
– driven by need to impact/dislocate OPEX+CAPEX
– looking for orders of magnitude
– evaluated on metric based performance, functional and economic measures
– avoid the “data falling on the floor phenomena”
– tools are really really important, keep tools and programming models simple

Hadoop Positiong
– Latency x Storage amount curve,

Cost comparisons
– SAN vs Hadoop HDFS cost comparison (GB/month)
– Hadoop much cheaper

Hadoop Additions and Must Haves:
– Improves SQL Front-End Tool Interoperability
– Improved Security & ACL enforcement – Kerberos Integration
– Grow Developer Programming Model Skill Sets
– Improve Relational Container Integration & Interop for Data Archival
– Management & Monitoring Tools
– Improved Developer & Debugging Tools
– Reduce Latency via integration with open source data caching
– memcached – others
– Invitation to FSI or Enterprise roundtable

Protein Alignment – Paul Brown, Booz Allen

Biological information
– Body – Cells – Chromosomes – Gene – DNA/RNA

Bioinformatics – The Pain
– too much data

So What? Querying a database of sequences for similar sequences
– one-to-many comparison
– 58000 proteins in PDB
– Protein alignment frequently used in the development of medicines
– Looking for a certain sequence across species, helps indicate function
Implementation in Hadoop
– distribute database sequence accross each node
– send query seq. inside Mapreduce (or dist.cache)
– scales well
– existing algorithms port easily

So What? Comparing sequences in bulk
– many-to-many
– DNA hybridiation (reconstruction)
Ran on AWS
Hadoop:
– if whole dataset fit into one computer
– Used distributed cache, assign each node a piece of the list
– But if the does not fit on one computer….
– pre-join all possible pairs with one MapReduce

So What? Analyzing really big sequences
– one big sequence to many small sequences
– scanning dna for structure
– population genetics
– hadoop implementatoin

Demonstration Implementation: Smith-Waterman Alignment
– one of the more computationally intensive matching and aligmnent techniques
– big matrix – (sequences to compare on row and column and calculations within)

Amazon implementation
– 250 machines
– E2
– run in 10 minutes for a single sequence. Runs in 24hrs for NxN comparison
– cost $40/hr

==> very cool 3D video of amazon ec2 nodes
– failing job due to 10% of nodes stuck on something (e.g. very long sequences)

Real-time Business Intelligence, Bradford Stephens

Topics
– Scalability and BI
– Costs and Abilities
– Search as BI

Tools: Zookeeper, Hbase, Katta (dist.search on Hadoop) and Bobo (faceted search for lucene)
– http://sourceforge.net/projects/bobo-browse/
– http://sourceforge.net/projects/katta/develop

100TB structured and unstructed data – Oracle 100M$, Hadoop and Katta 250K$

Building data cubes in real time (with faceted search)

Real-time Mapreduce on HBase
Search/BI as a platform – “google my datawarehouse”

Counting, Clustering and other data tricks, Derek Gottfried, New York Times

back in 2007 – would like to try as many EC2 instances as possible

Problem
– freeing up historical archives of NYTimes.com (1851-1922)
(in the public domain)

Currently:
– 2009 – web analytics
3 big data buckets:
1) registration/demographics
2) articles 1851-today
– a lot of metadata about each article
– unique data, extract people, places, .. to each article => high precision search
3) usage data/web logs
– biggest piece – piles up

How do we merge the 3 datasets?

Using EC2 – 20 machines
Hadoop 0.20.0
12 TB of data
Straight MR in Java
(mostly java + postprocessing in python)

combining weblog data with demographic data, e.g. twitter clicks backs by age group

Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Posted in cloud computing | Tagged finance, hadoop, hadoopworld, mapreduce, search | 1 Comment

Hadoop World 2009 – some notes from morning session

Posted on October 2, 2009 by Amund Tveit

Location: Roosevelt Hotel NYC

09:11 – Christophe Bisciglia (Cloudera)

Announcement about BOFs HBASE and UI Birds of a Feather
Hadoop history overview
happenings during the last year: Hive, Pig, Sqoop (data import) ++
yesterday: Vertica announced mapreduce support for their database system
Walkthrough of Clouderas distribution for Hadoop
ANNOUNCEMENT: deploy clouderas dist. for hadoop on softlayer and rackspace

09:23 – Jeff Hammerbacher (Cloudera)
started his career at bear sterns
– Cloudera is a software company with Apache Hadoop at the core
There is a lot more sw to be built:
1) collection,
2) processing
3) report and analysis

The Apache Hadoop community is the center of innovation for Big Data
– Yahoo pushing env. on scalability
– Large clusters for academic research (yahoo, hp and intels open cirrus)
– nsf, ibm and google’s clue
– sigmod best paper award: Pig team from Yahoo
– worldwide – Hadoop world beijing

Cloudera Desktop
4 applications running on this desktop (inside the browser)
1) HDFS Web Interface
– file browser
2) Hadoop Mapreduce Web Interface (can potentially debug)
– Job Browser (Cluster Detail)
3) Cluster Health
– pulls in all kinds of metrics from a hadoop cluster
4) Job Designer
– makes it easier to use for non-tech users
note: available for free (can be locally modified), but not redistribute
window manager based on MooTools

Cloudera Desktop API
– building a reusable API for dev. dekstop appl
– would like to capture innovation of ecosystem in a single interface
– desktop-api-s

0940 – Peter Sirota (Amazon, general manager Amazon Elastic Mapreduce – EMR)

motivation: large scale data processing has a lot of MUCK, wanted to fix that.

Use cases for EMR:
– data mining (log processing, clicks analysis)
– bioinformatics (genome analysis)
– financial simulation (monte carlo)
– file processing (resize jpegs, ocr) – a bit unexpected
– web indexing

Customer feedback:
Pros: easy to use and reliable
Challenges: require fluency in mapreduce, and hard to debug

New features:
support for Apache Pig (batch and interactive mode), August 2009
support for Apache Hive 0.4 (batch and interactive mode), TODAY
– extended language to support S3
– specify off-instance-metadata store
– optimized data writes to S3
– reference resources on S3

ANNOUNCEMENT TODAY – Karmashpere Studio for Hadoop – Netbeans IDE
– deploy hadoop jobs to EMR
– monitor progress of EMR job flows
– amazon S3 file browser

ANNOUNCEMENT TODAY – Support for Cloudera’s Hadoop distribution
– can specify Cloudera’s distribution (and get support from Cloudera)
– in private beta

0951 – Amazon EMR case – eHarmony – Carlos – Will present Use case for matchmaking system
data: 20 million users, 320 item questionaire => big data
results: 2% of US marriages
Using Amazon, S3 and Elastic Mapreduce
Interesting with HIVE to do analysis

0958 – Amazon EMR IDE Support – Karmasphere IDE for Hadoop
works with all versions of Hadoop
tighly integrated with EMR (e.g. monitoring and files)

1005 – Eric Baldeschwieler – Yahoo
Largest contributor, tester and user of Hadoop
Hadoop is driving 2% of marriages in the US!
4 tiers of Hadoop clusters:
1) dev. testing and QA (10% of HW)
– continuous integration and testing
2) proof of concepts and ad-hoc work (10% of HW)
– run the latest version, currently 0.20
3) science and research (60% of HW)
– runs more stable versions, currently 0.20
4) production (20% of HW)
– the most stable version of Hadoop, currently 0.18.3

Yahoo has more than 25000 nodes with Hadoop (4000 nodes per cluster), 82 Petabytes of data.

Why Hadoop@Yahoo?
– 500M users, billions of “transactions”/day, Many petabytes of data
– analysis and data processing key to our business
– need to do this cost effectively
=> Hadoop provides solution to this

Previous job: chief architect for web search at Yahoo
Yahoo frontpage example (use of Hadoop):
– content optimization, search index, ads optimization, spam filters, rss feeds,

Webmap 2008-2009
– 70 hours runtime => 73 hours runtime
– 300TB shuffling => 490TB shuffling
– 200TB output -> 280TB (+55% HW, but more analysis

Sort benchmark 2008-2009
– 1 terabyte 209 seconds => 62 seconds on 1500 nodes
– 1 petabyte sorted – 16.25 hours, 3700 nodes

Hadoop has Impact on productivity
– research questions answered in days, not months
– moved from research to prod easily

Major factors:
– don’t need to find new HW to experiment
– can work with all your data
– prod. and research on same framework
– no need for R&D to do IT, clusters just work

Search Assist (index for search suggest)
3 years of log-data, 20 steps of mapreduce
before hadoop: 26 days runtime (SMP box), C++, 2-3 weeks dev.time
after hadoop: 20 minutes runtime, python, 2-3 days dev.time

Current Yahoo Development
Hadoop:
– simplifies porting effort (between hadoop versions), freeze APIs, Avro
– GridMix3, Mumak simulator – for performance tuning
– quality engineering
Pig
– Pig – SQL and Metadata, Zebra – column-oriented storage access layer, Multi-query, lots of other optimizations
Oozie

1035 Rod Smith, IBM

Customer Scenarios
– BBC Digital Democracy project
– Thomson Reuters
– IBM Emerging Technology Projects: M2 (renamed later to M42)
– insight engine for ad-hoc business insights running ontop of Hadoop and Pig
– macro-support (e.g. extract patent information)
– collections (probably renamed to worksheets later)
– visualization (tag cloud)
– example 1: evaluate companies with patent information(1.4 million patents)
– using American Express as case study
– counting patent citations
– example 2: patents in litigation
– quote: “in god we trust, everybody else bring data”

1104 Ashish Thusoo – Facebook – Hive datawarehousing system

Hadoop
Pros: superior in availability/scalability/manageability, open system, scalable cost
Cons: programmability and metadata, mapreduce hard to program (users know sql/bash/python/perl), need to publish in well-known schemas
=> solution: Hive

Hive: Open and Extensible
– query your own formats and types with serializer/deserializer
– extend SQL functionality through user defined functions
– do any non-SQL TRANSFORM operator (e.g. embed Python)

Hive: Smart Execution Plans for Performance
– Hash-based Aggregations
– Map-Side Joins
– Predicate Pushdown
– Partition Pruning
– ++

Interoperability
– JDBC and ODBC interfaces available
– integrations with some traditional SQL tools (e.g. Microstrategy for reports within Facebook) with some minor modifications
– ++

Hive Information
– subproject of Hadoop

— Date Warehousing @ Hadoop —

Data Flow Architecture at Facebook
web server logs -> Scribe -> filers (Hadoop clusters)
to save cost: Scribe/Hadoop integration
Federated MySQL also connected to the Production Hive/Hadoop Cluster
Connected to Oracle BAC and also replicated to an AdHoc Hive cluster

Showed a Picture of Yahoo cluster/datacenter 😀

Dimensions:
4800 cores, 5.5 PB,

Statistics per day:
– 4TB compr.data/day
– 135TB scanned per day
– 7500 Hive jobs/day
– 80K compute hours per day

Hive Simplifies Hadoop:
– New engineers go through a Hive training session
– 200 people/moth use it

Applications:
– reporting (daily/weekly aggregations of impression/click counts)
– measures of user engagement
– microstragy dashboards

Ad hoc analysis
– how many group admins broken down by state/country

Machine learning (assembling training data)
– ad optimization
– e.g. user engagement as function of user attributes

Facebook Hive contributions
– Hive, HDFS features, Scheduler work
– Talks by Dhruba Borhthakur and Zheng Shao in the dev.track

Q from audience: relation to Cassandra?
A: Cassandra serving live traffic,

Q from audience: when to use Pig or Hive?
A: Hive has more SQL support, but Pig also gets more of that. Hive is
very intuitive. If you want interoperability (e.g. microstrategy)
advantages with using Hive. Pig has some nice primitatives and
supports more unstructured data model

Posted in cloud computing | Tagged amazon, cloudera, facebook, hadoop, ibm, yahoo | 3 Comments

Mapreduce & Hadoop Algorithms in Academic Papers

Posted on October 1, 2009 by Amund Tveit

The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org

An updated and extended version of this blog post can be found here.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Disclaimer: this is work in progress (look for updates)

Input Data – Academic Papers
Scholar has 981 papers citing the original Mapreduce paper from 2004 – a citation amount that is approximately 10 thousand pages (~ size of a typical encyclopedia)

What types of papers cite the mapreduce paper?

Algorithmic papers
General cloud overview papers
Cloud infrastructure papers
Future work sections in papers (e.g. “we plan to implement this with Hadoop”)

=> Looked at category 1 papers and skipped the rest

Who wrote the papers?

Search/Internet companies/organizations: eBay, Google, Microsoft, Wikipedia, Yahoo and Yandex.
IT companies: Hewlett Packard and Intel
Universities: Carnegie Mellon Univ., TU Dresden, Univ. of Pennsylvania, Univ. of Central Florida, National Univ. of Ireland, Univ. of Missouri, Univ. of Arizona, Univ. of Glasgow, Berkeley Univ. and National Tsing Hua Univ., Univ. of California, Poznan Univ.

Which areas do the papers cover?

Machine Translation

Grammar based statistical MT on Hadoop

Large Language Models in Machine Translation

Information/Entity Extraction and Tagging
Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)

Classification
Scaling Up Classifiers to Cloud Computers (2008)

Ads Analysis
Large-Scale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)

Search Query Analysis
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)

Indexing & parsing
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)

Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
– Clustering of emails into spam campaign
– Finding probability that 2 spam messages are sent form same machine
– Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Web-based Malware (2007)

Image and Video Processing
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Video Stream Re-Rendering
Map-Reduce Meets Wider Varieties of Applications (2008)
– Location detection in images

Networking
Reducible Complexity in DNS

Simulation
Map-Reduce Meets Wider Varieties of Applications (2008)
– Simulation of earthquakes (geology)

Statistics
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
– Measuring Wikipedia Editor similarity
Map-Reduce Meets Wider Varieties of Applications (2008)
– Netflix video recommendation
Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

Graphs
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)

Conclusion
On the papers looked at most of them are focused on IT-related areas, there is lots of unwritten in academia about mapreduce and hadoop applied for algorithms in other business and technology areas.

Opportunity for following up this posting can be to: 1) in more detail describe the algorithms (e.g. input/output formats), 2) try to classify them by patterns (e.g. with similar code structure), 3) offer the opportunity to simulate them in the browser (on toy-sized data sets) and 4) provide links to Hadoop implementations of them.

Posted in cloud computing, Hadoop and Mapreduce, infrastructure | Tagged ebay, google, hp, intel, wikipedia, yahoo, yandex | 1 Comment

How to get pip/virtualenv/Fabric working on Cygwin

Posted on September 21, 2009 by Thomas Brox Røst

If you are new to virtualenv, Fabric or pip is, Alex Clemesha’s excellent “Tools of the Modern Python Hacker” is a must-read.

In short: virtualenv lets you switch seamlessly between isolated Python environments, Fabric automates remote deployment, while pip takes care of installing required packages and dependencies. If you have ever had to wrestle with more than one development project at the same time, then virtualenv is one of those tools that, once mastered, you can’t see yourself living without. Fabric and pip are somewhat immature, but still highly useful in their present shapes. It is likely that you will end up learning them anyway. Best of all, these three tools play very nicely together.

Except on Cygwin.

Here at Atbrox, we spend quite a lot of our time on Windows platforms. While Cygwin adds a fair amount of unix functionality to Windows, configuring certain applications can be difficult. This article describes the steps we go through to get an operational virtualenv, Fabric and pip setup on Windows Vista. It also gives you a brief taster of how virtualenv and Fabric works.

Step 1 – Install Cygwin: If you haven’t already, Cygwin can be installed from this page. Click the “View” button once to get a full list of available packages. Make sure to include at least the following packages (the numbers in the parentheses indicate the versions used at the time of writing):

python (2.5.2-1)
python-paramiko (1.7.4-1)
python-crypto (2.0.1-1)
gcc (3.4.4-999)
wget (1.11.4-3)
openssh (5.1p1-10)

Now would also be a good time to install other common packages such as vim, git, etc.—but you can always go back and install them at a later time.

Note that we are using Cygwin Python rather than the standard Windows Python. I had nothing but trouble trying to get Windows Python to play nicely along with virtualenv and Fabric, so this is a compromise. The downside is that you are stuck with a rather dated and somewhat buggy version of Python. If someone manages to get this setup working with Windows Python, then let me know!

Step 2 – Get paramiko working: The python-paramiko and python-crypto packages are required to get Fabric deployment over SSH working properly. If you are lucky, paramiko should work out of the box. If you don’t get the following error message when importing paramiko then skip the rest of this step:

$ python
Python 2.5.2 (r252:60911, Dec  2 2008, 09:26:14)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt; import paramiko
Traceback (most recent call last):
 File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
 File &quot;__init__.py&quot;, line 69, in &lt;module&gt;
 File &quot;transport.py&quot;, line 32, in &lt;module&gt;
 File &quot;util.py&quot;, line 31, in &lt;module&gt;
 File &quot;common.py&quot;, line 101, in &lt;module&gt;
 File &quot;rng.py&quot;, line 69, in __init__
 File &quot;randpool.py&quot;, line 87, in __init__
 File &quot;randpool.py&quot;, line 120, in _randomize
IOError: [Errno 0] Error

According to the discussion here, this appears to be a lingering Cygwin bug. The workaround is to change line 120 in /usr/lib/python2.5/site-packages/Crypto/Util/randpool.py from


if num!=2 : raise IOError, (num, msg)

if num!=2 and num!=0 : raise IOError, (num, msg)

Paramiko should now import without any complaints.

Step 3 – Install setuptools: Setuptools are required for installing the rest of the required Python packages. Instructions for Cygwin are found on the setuptools pages—but just enter the following and you’ll be all set:

$ wget http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c9-py2.5.egg
$ sh setuptools-0.6c9-py2.5.egg

Step 4 – Install pip, virtualenv and virtualenvwrapper: We haven’t said anything about virtualenvwrapper so far. This extension to virtualenv streamlines working with multiple environments and is well recommended:

$ easy_install pip
$ easy_install virtualenv
$ easy_install virtualenvwrapper
$ mkdir ~/.virtualenvs

That last line creates a working directory for your virtual Python environments. When e.g. working with an environment named myenv, all packages will be installed in ~/.virtualenvs/myenv.

I find it useful to create and activate a default environment called sandbox. This helps prevent package installations to the default Python site-packages. It’s a good strategy in general to avoid polluting the main package directory so that almost all package installations are per project and virtual environment. Run the following commands to create the sandbox environment:

$ export WORKON_HOME=$HOME/.virtualenvs
$ export PIP_VIRTUALENV_BASE=$WORKON_HOME
$ source /usr/bin/virtualenvwrapper_bashrc
$ mkvirtualenv sandbox

mkvirtualenv is a virtualenvwrapper command that creates the given environment. If you get an IOError: [Errno 2] No such file or directory: '/usr/local/bin/python2.5' you will have to add a symbolic link to the Python executable:

$ ln -s /usr/bin/python2.5.exe /usr/bin/python2.5

Note that whenever you execute a shell command, the bash prompt will remind you of the active environment:

$ echo &quot;foo&quot;
foo
(sandbox)

To make the sandbox activation permanent, append the following lines to your ~/.bashrc:

export WORKON_HOME=$HOME/.virtualenvs
export PIP_VIRTUALENV_BASE=$WORKON_HOME
source /usr/bin/virtualenvwrapper_bashrc
workon sandbox

The workon is another virtualenvwrapper extension that switches you to the given environment. To get a full list of available environments, type workon without an argument. Other useful commands are deactivate to step out of the currently active environment, and rmvirtualenv to delete an environment. Refer to the virtualenvwrapper documentation for the whole story.

As a sanity check, try exiting and restarting the Cygwin shell. If you have paid attention so far, you should now automatically end up in the sandbox environment.

Step 5 – Install Fabric: From this point and on, all installed packages, including Fabric, will end up in a virtual environment. Fabric is undergoing a major rewrite right now, so given that its interface is quite unstable it is preferable to have a per-project installation anyway.

First we create a test environment named myproject:

$ mkvirtualenv myproject

We have to make some modifications to the Fabric source code, so we can’t use pip for installing it. Make sure to use version 0.9 or higher, as version 0.1 is already quite outdated:

$ mkdir ~/tmp
$ cd ~/tmp
$ wget http://git.fabfile.org/cgit.cgi/fabric/snapshot/fabric-0.9b1.tar.gz
$ tar xzf fabric-0.9b1.tar.gz
$ cd fabric-0.9b1

Fabric is run using the fab command, but if we try to install it as is, the following error might show up:

$ fab
Traceback (most recent call last):
 File &quot;/home/brox/.virtualenvs/myproject/bin/fab&quot;, line 8, in &lt;module&gt;
   load_entry_point('Fabric==0.1.1', 'console_scripts', 'fab')()
 File &quot;/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py&quot;, line 277, in load_entry_point
 File &quot;/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py&quot;, line 2180, in load_entry_point
 File &quot;/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py&quot;, line 1913, in load
 File &quot;/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/fabric.py&quot;
, line 53, in &lt;module&gt;
   import win32api
ImportError: No module named win32api

At the time of writing there is a small bug in Fabric that is likely to be fixed in the near future. For now you have to manually modify a file in fabric/state.py before you install. Change the line that says

win32 = sys.platform in ['win32', 'cygwin']

win32 = sys.platform in ['win32']

This is just to tell Fabric that Cygwin isn’t really Windows and that the win32api module therefore isn’t available. Having made the necessary change, do a regular installation from source:

$ python setup.py install

The following error message about paramiko not being found might pop up; just ignore it:

local packages or download links found for paramiko==1.7.4
error: Could not find suitable distribution for Requirement.parse('paramiko==1.7.4')

And that’s it! You should now have a fully functional virtualenv/Fabric/pip setup. To verify that Fabric works, create a file called fabfile.py:

from fabric.api import local, run

def local_test():
    local('echo &quot;foo&quot;')

def remote_test():
    run('uname -s')

This file, of course, only scratches the surface of what you can do with Fabric—refer to the latest documentation for more information.

To test the fabfile, type the following:

$ fab local_test
[localhost] run: echo &quot;foo&quot;

Done.

The biggest issue is that of getting Fabric to play along with your SSH installation so that you can deploy on remote servers. (You did install the openssh package, right?). Try the following command, substituting test@atbrox.com with one of your own accounts:

$ fab remote_test
No hosts found. Please specify (single) host string for connection: test@atbrox.com
[test@atbrox.com] run: uname -s
Password:
[test@atbrox.com] out: Linux

Done.
Disconnecting from test@atbrox.com... done.

The next step would be to set up password-less logins, but that is a different story.

Afterthoughts: While Cygwin is a lifesaver, it has some quirks and annoyances that may or may not be an issue depending on your system configuration. For instance, on my setup the following error tends to show up randomly when using Fabric for remote deployment:

sem_init: Resource temporarily unavailable
Traceback (most recent call last):
 File &quot;build/bdist.cygwin-1.5.25-i686/egg/fabric/main.py&quot;, line 454, in main
 File &quot;/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py&quot;, line 187, in
deploy
   _prepare_host_global()
 File &quot;/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py&quot;, line 137, in
_prepare_host_global
   if not exists(u'/usr/bin/virtualenvwrapper_bashrc'):
 File &quot;build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py&quot;, line 32, in
exists
 File &quot;/usr/lib/python2.5/contextlib.py&quot;, line 33, in __exit__
   self.gen.throw(type, value, traceback)
 File &quot;/usr/lib/python2.5/contextlib.py&quot;, line 118, in nested
   yield vars
 File &quot;build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py&quot;, line 32, in
exists
 File &quot;build/bdist.cygwin-1.5.25-i686/egg/fabric/network.py&quot;, line 371, in host
_prompting_wrapper
 File &quot;build/bdist.cygwin-1.5.25-i686/egg/fabric/operations.py&quot;, line 422, in r
un
 File &quot;channel.py&quot;, line 297, in recv_exit_status
 File &quot;/usr/lib/python2.5/threading.py&quot;, line 368, in wait
   self.__cond.wait(timeout)
 File &quot;/usr/lib/python2.5/threading.py&quot;, line 210, in wait
   waiter = _allocate_lock()
thread.error: can't allocate lock

This is a known problem that is not likely to go away anytime soon, due to an inherent race condition in Cygwin’s implementation of sem_init. Still, having a functional virtualenv/Fabric/pip environment on Windows is all in all pretty convenient.

There is a slew of useful articles out there if you need more information on the tools described in this article. These are my current favorites:

Tools of the Modern Python Hacker: Virtualenv, Fabric and Pip (note that most of the Fabric articles out there use an outdated version of the Fabric API, so have a look at the latest documentation as well.)
A Primer on virtualenv
Django Deployment with virtualenv and pip
Using pip Requirements

Posted in infrastructure | Tagged cygwin, fabric, pip, python, virtualenv, windows | 12 Comments

atbrox ready for business

Posted on September 7, 2009 by admin

We are here to help you:

Understand if and how the cloud can be cost-efficient in your setting
Efficiently analyze large data sets using the cloud
Architect, develop and deploy scalable and reliable software for the cloud
Adapt and migrate your existing data and software to the cloud

Technologies and methods we (non-exclusively) use:

Hadoop and Elastic Mapreduce for analyzing large data sets
Data Mining and Machine Learning
Our own Technology and Services

Our motto is Simplicity, Automation and Scalability

If you are considering using cloud computing, please drop us a line to info (at) atbrox.com

Posted in Atbrox | Tagged amazon, automation, aws, cloud, data analysis, elastic mapreduce, hadoop, mapreduce, scalability, simplicity | Leave a comment

Hadoop World 2009 – some notes from application session

Hadoop World 2009 – some notes from morning session

Mapreduce & Hadoop Algorithms in Academic Papers

How to get pip/virtualenv/Fabric working on Cygwin

atbrox ready for business

Archives

Meta