Apr 09

Wrote about mapreduce in search in a presentation for next week.

(more up-to-date pdf version of the presentation)

Best regards,
Amund
Atbrox

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Feb 08

Atbrox is startup providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research.

Update 2010-June-17 Code for this posting is now on github –http://github.com/atbrox/Snabler

This posting gives an example of how to use Mapreduce, Python and Numpy to parallelize a linear machine learning classifier algorithm for Hadoop Streaming. It also discusses various hadoop/mapreduce-specific approaches how to potentially improve or extend the example.

1. Background

Classification is an everyday task, it is about selecting one out of several outcomes based on their features, e.g

  • In recycling of garbage you select the bin based on the material, e.g. plastic, metal or organic.
  • When purchasing you select the store from based e.g. on its reputation, prior experience, service, inventory and prices

Computational Classification – Supervised Machine Learning – is quite similar, but requires (relatively) well-formed input data combined with classification algorithms.

1.1 Examples of classification problems

  • Finance/Insurance
    • Classify investment opportunities as good or not e.g. based on industry/company metrics, portfolio diversity and currency risk.
    • Classify credit card transactions as valid or invalid based e.g. location of transaction and credit card holder, date, amount, purchased item or service, history of transactions and similar transactions
  • Biology/Medicine
  • Internet
  • Production Systems (e.g. in energy or petrochemical industries)
    • Classify and detect situations (e.g. sweet spots or risk situations) based on realtime and historic data from sensors

1.2 Classification Algorithms

Classification algorithms comes in various types (e.g. linear, nonlinear, discriminative etc), see my prior postings Pragmatic Classification: The Very Basics and Pragmatic Classification of Classifiers.

Key takeaways about classifiers:

  1. There is no silver bullet classifier algorithm or feature extraction method.
  2. Classification algorithms tend to be computationally hard to train, this encourages using a parallel approach, in this case with Hadoop/Mapreduce.

2. Parallel Classification for Hadoop Streaming

The classifier described belongs to a familiy of classifiers which have in common that they can mathematically be described as Tikhonov Regularization with a Square loss function, this family includes Proximal SVM, Ridge Regression, Shrinkage Regression and Regularized Least-Squares Classification. (note: If you replace the Square Loss function with a Hinge-Loss function you get Support Vector Machine classification). The implemented classifier – proximal SVM – is from the paper Incremental Support Vector Machine Classification, referred to as the paper below.

2.1 training data

The classifier assumes numerical training data, where each class is either -1.0 og +1.0 (negative or positive class), and features are represented as vectors of positive floating point numbers. In the algorithm below are:

D - a matrix of training classes, e.g. [[-1.0, 1.0, 1.0, .. ]]
A - a matrix with feature vectors, e.g. [[2.9, 3.3, 11.1, 2.4], .. ]
e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0]
E = [A -e]
mu = scalar constant # used to tune classifier
D - a diagonal matrix with -1.0 or +1.0 values (depending on the class)

2.2 the classifier algorithm

Training the classifier can be done with right side of the equation (13) from paper

(omega, gamma) = (I/mu + E.T*E).I*(E.T*D*e)

Classification of an incoming feature vector x can then be done by calculating:

x.T*omega - gamma

which returns a number, and the sign of the number corresponds to the class, i.e. positive or negative.

2. Parallelization of the classifier with Hadoop Streaming and Python

Expression (16) in the paper has a nice property, it supports increments (and decrements), in the example there are 2 increments (and 2 decrements), but by induction there can be as many as you want:

(omega, gamma) = (I/mu + E_.T*E_1 + .. + E_i.T*E_i).I*
                 (E_1.T*D_1*e + .. + E_i.T*D_i*e)

where

E.T*E = E_1.T*E_1 + .. + E_i.T*E_i

and

E.T*De = E_1.T*D_1*e + .. + E_i.T*D_i*e

This means that we can parallelize the calculation of E.T*E and E.T*De, by having Hadoop mappers calculate each of the elements of the sums in as in the Python map() code below (sent to reducers as tuples)

map() and reduce() - dataflow - basic case

2.3 – the mapper

def map(key, value):
   # input key= class for one training example, e.g. "-1.0"
   classes = [float(item) for item in key.split(",")]   # e.g. [-1.0]
   D = numpy.diag(classes)

   # input value = feature vector for one training example, e.g. "3.0, 7.0, 2.0"
   featurematrix = [float(item) for item in value.split(",")]
   A = numpy.matrix(featurematrix)

   # create matrix E and vector e
   e = numpy.matrix(numpy.ones(len(A)).reshape(len(A),1)) 
   E = numpy.matrix(numpy.append(A,-e,axis=1)) 
   
   # create a tuple with the values to be used by reducer
   # and encode it with base64 to avoid potential trouble with '\t' and '\n' used 
   # as default separators in Hadoop Streaming
   producedvalue = base64.b64encode(pickle.dumps( (E.T*E, E.T*D*e) )    

   # note: a single constant key "producedkey" sends to only one reducer
   # somewhat "atypical" due to low degree of parallism on reducer side
   print "producedkey\t%s" % (producedvalue)

2.4 – the Reducer

def reduce(key, values, mu=0.1):
  sumETE = None
  sumETDe = None
  
  # key isn't used, so ignoring it with _ (underscore).
  for _, value in values:
    # unpickle values
    ETE, ETDe = pickle.loads(base64.b64decode(value)) 
    if sumETE == None:
      # create the I/mu with correct dimensions
      sumETE = numpy.matrix(numpy.eye(ETE.shape[1])/mu)
    sumETE += ETE

    if sumETDe == None:
      # create sumETDe with correct dimensions
      sumETDe = ETDe
    else:
      sumETDe += ETDe
    
    # note: omega = result[:-1] and gamma = result[-1] 
    # but printing entire vector as output
    result = sumETE.I*sumETDe 
    print "%s\t%s" % (key, str(result.tolist()))

2.5 – Mapper and Reducer Utility Code

Code used to run map() and reduce() methods, inspired by iterator/generator approach from this mapreduce tutorial.

def read_input(file, separator="\t"):
    for line in file:
        yield line.rstrip().split(separator)
def run_mapper(map, separator="\t"):
    data = read_input(sys.stdin,separator)
    for (key,value) in data:
        map(key,value)
def run_reducer(reduce,separator="\t"):
    data = read_input(sys.stdin, separator)
    for key, values in groupby(data, itemgetter(0)):
        reduce(key, values)

3. Finished?

Assume your running time goes through the roof even with the above parallel approach, what to do?

3.1 Mapper Increment Size really makes a difference!

Since there is only 1 reducer in the presented implementation, it is useful to let mappers do most of the job. The size of the (increment) matrices – E.T*E and E.T*D*e given as input to the reducer is independent of number of training data, but dependent on the number of classification features. The workload on the reducer is also dependent on the number of matrices received by the mappes (i.e. increment size), e.g. if you have a 1000 mappers having one billion examples with 100 features each, the reducer would need to do a sum of one trillion 101×101 matrices and one trillion 101×1 vectors if the mapper sent one matrix pair per training example, but if each mapper only sent one pair of E.T*E and E.T*D*e representing all the mappers billion training examples the reducer would only need to summarize 1000 matrix pairs.

3.2 Avoid stressing the reducer

Add more (intermediate) reducers (combiners) that calculates partial sums of matrices. In the case of many small increments (and correspondingly many matrices) it can be useful to add an intermediate step that (in parallel) calculates sums of E.T*E and E.T*D*e before sending the sums to the final reducer, this means that the final reducer gets fewer matrices to summarize before calculating the final answer, see figure below.
flow with intermediate mapreduce step

3.3 Parallelize (or replace) the matrix inversion in the reduction step

If someone comes along with a training data set with a very high feature-dimension (e.g. recommender systems, bioinformatics or text classification), the matrix inversion in the reducer can become a real bottleneck since such algorithms typically are O(n^3) (and lower bound of Omega(n^2 lg n)), where n is the number of features. A solution to this can be to use or develop hadoop/mapreduce-based parallel matrix inversion, e.g. Apache Hama, or don’t invert the matrix...

3.4 Feature Dimensionality Reduction

Another approach when having training data with high feature-dimension could be to reduce feature-dimensionality, for more info check out Latent Semantic Indexing (and Analysis), Singular Value Decomposition or t-Distributed Stochastic Neighbor Embedding

3.5 Reduce IO between mappers and reducers with compression

Twitter presented using LZO compression (on the Cloudera blog) to speed up Hadoop. Inspired by this one could in the case of high feature dimension, i.e. large E.T*E and E.T*D*e matrices, compress the output in the mapper and decompress in the reducer by replacing base64encoding/decoding and pickling above with:

producedvalue = base64.b64encode(lzo.compress(pickle.dumps( (E.T*E, E.T*D*e) ), level=1)

and

ETE, ETDe = pickle.loads(lzo.decompress(base64.b64decode(value)))

3.6 Do more work with approximately the same computing resources

The D matrix above represents binary classification with a value of +1 or -1 representing each class. It is quite common to have classification problems with more than 2 classes. Supporting multiple classes is usually done by training by several classifiers, either 1-against-all (1 classifier trained per class) or 1-against-1 (1 classifier trained per unique pair of classes), and the run a tournament of them against each other and pick the most confident. In the case of 1-against-all classification the mapper could probably send multiple E.T*D_c*e – with one D_c per class and keep the same E.T*E, the reducer would then need to calculate (I/mu + E.TE).I once and independently multiply with several E.T*D_c*e sums to create a set of (omega,gamma) classifiers. For 1-against-1 classification it becomes somewhat more complicated, because it involves creating several E matrices since in the 1-against-1 case only the rows in E where the 2 classes competing occur are relevant.

4. Code

(Early) Python code of the algorithm presented above can be found at http://github.com/atbrox/Snabler (open source with Apache Licence). Please let me know if you want to contribute to the project, e.g. from mapreduce and hadoop algorithms in academic papers.

5. More resources about machine learning with Hadoop/Mapreduce?

Atbrox on LinkedIn

Best regards,

Amund Tveit, co-founder of Atbrox


Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Nov 11

Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each mapper and reducer, Cloudera’s Jeff Hammerbacher outlined how to do that using Hadoop Distributed Cache and Peter Skomorroch suggested how to load Boto to access Elastic Blockstore (EBS), this posting is based on those ideas and gives a detailed description how to do it.

How to combine Elastic Mapreduce with other AWS Services

This posting shows how to load boto in an Elastic Mapreduce mapper and gives a simple example how to use simpledb from the same mapper. For accessing other AWS services, e.g. SQS from Elastic Mapreduce check out the Boto documentation (it is quite easy when the boto + emr integration is in place).

Other tools used (prerequisites):

Step 1 – getting and preparing the Boto library

wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
# note: using virtualenv can be useful if you want to
# keep your local Python installation clean
tar -zxvf boto-1.8d.tar.gz ; cd boto-1.8d ; python setup.py install
cd /usr/local/lib/python2.6/dist-packages/boto-1.8d-py2.6.egg
zip -r boto.mod boto

Step 2 – mapper that loads boto.mod and uses it to access SimpleDB

# this was tested by adding code underneath to the mapper
# s3://elasticmapreduce/samples/wordcount/wordSplitter.py

# get boto library
sys.path.append(".")
import zipimport
importer = zipimport.zipimporter('boto.mod')
boto = importer.load_module('boto')

# access simpledb
sdb = boto.connect_sdb("YourAWSKey", "YourSecretAWSKey")
sdb_domain = boto.create_domain("mymapreducedomain") # or get_domain()
# ..
# write words to simpledb
  for word in pattern.findall(line):
      item = sdb_domain.create_item(word)
      item["reversedword"] = word[::-1]
      item.save()
      # ...

Step 3 – json config file – bototest.json – for Elastic Mapreduce Ruby Client

[	
  { 
	"Name": "Step 1: testing boto with elastic mapreduce", 
        "ActionOnFailure": "<action_on_failure>", 
        "HadoopJarStep": { 
		"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar", 
          	"Args": [ 
            	"-input", "s3n://elasticmapreduce/samples/wordcount/input", 
            	"-output", "s3n://yours3bucket/result",
            	"-mapper", "s3://yours3bucket/botoWordSplitter.py",
            	"-cacheFile", "s3n://yours3bucket/boto.mod#boto.mod",
          	] 
        } 
  }
]

Step 4 – Copy necessary files to s3

s3cmd put boto.mod s3://yours3bucket
s3cmd put botoWordSplitter.py s3://yours3bucket

Step 5 – And run your Elastic Mapreduce job

 elastic-mapreduce --create \
                   --stream \
                   --json bototest.json \
                   --param "<action_on_failure>=TERMINATE_JOB_FLOW"

Conclusion
This showed how to dynamically load boto and use it to access one other AWS service – SimpleDB – from Elastic Mapreduce. Boto supports most AWS services, so the same integration approach should work also for other AWS services, e.g. SQS (Queuing Service), RDS (MySQL Service) and EC2, check out the Boto API documentation or Controlling the Cloud with Python for details.

Note: a very similar integration approach should work for most Python libraries, also those that use/wrap C/C++ code (e.g. machine learning libraries such as PyML and others), but then it might be needed to do step 1 on Debian AMIs similar to what Elastic Mapreduce is using, check out a previous posting for more info about such AMIs.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 01

The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org

An updated and extended version of this blog post can be found here.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Disclaimer: this is work in progress (look for updates)

Input Data – Academic Papers
Scholar has 981 papers citing the original Mapreduce paper from 2004 – a citation amount that is approximately 10 thousand pages (~ size of a typical encyclopedia)

What types of papers cite the mapreduce paper?

  1. Algorithmic papers
  2. General cloud overview papers
  3. Cloud infrastructure papers
  4. Future work sections in papers (e.g. “we plan to implement this with Hadoop”)

=> Looked at category 1 papers and skipped the rest

Who wrote the papers?

Search/Internet companies/organizations: eBay, Google, Microsoft, Wikipedia, Yahoo and Yandex.
IT companies: Hewlett Packard and Intel
Universities: Carnegie Mellon Univ., TU Dresden, Univ. of Pennsylvania, Univ. of Central Florida, National Univ. of Ireland, Univ. of Missouri, Univ. of Arizona, Univ. of Glasgow, Berkeley Univ. and National Tsing Hua Univ., Univ. of California, Poznan Univ.

Which areas do the papers cover?

Conclusion
On the papers looked at most of them are focused on IT-related areas, there is lots of unwritten in academia about mapreduce and hadoop applied for algorithms in other business and technology areas.

Opportunity for following up this posting can be to: 1) in more detail describe the algorithms (e.g. input/output formats), 2) try to classify them by patterns (e.g. with similar code structure), 3) offer the opportunity to simulate them in the browser (on toy-sized data sets) and 4) provide links to Hadoop implementations of them.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Sep 21

If you are new to virtualenv, Fabric or pip is, Alex Clemesha’s excellent “Tools of the Modern Python Hacker” is a must-read.

In short: virtualenv lets you switch seamlessly between isolated Python environments, Fabric automates remote deployment, while pip takes care of installing required packages and dependencies. If you have ever had to wrestle with more than one development project at the same time, then virtualenv is one of those tools that, once mastered, you can’t see yourself living without. Fabric and pip are somewhat immature, but still highly useful in their present shapes. It is likely that you will end up learning them anyway. Best of all, these three tools play very nicely together.

Except on Cygwin.

Here at Atbrox, we spend quite a lot of our time on Windows platforms. While Cygwin adds a fair amount of unix functionality to Windows, configuring certain applications can be difficult. This article describes the steps we go through to get an operational virtualenv, Fabric and pip setup on Windows Vista. It also gives you a brief taster of how virtualenv and Fabric works.

Step 1 – Install Cygwin: If you haven’t already, Cygwin can be installed from this page. Click the “View” button once to get a full list of available packages. Make sure to include at least the following packages (the numbers in the parentheses indicate the versions used at the time of writing):

  • python (2.5.2-1)
  • python-paramiko (1.7.4-1)
  • python-crypto (2.0.1-1)
  • gcc (3.4.4-999)
  • wget (1.11.4-3)
  • openssh (5.1p1-10)

Now would also be a good time to install other common packages such as vim, git, etc.—but you can always go back and install them at a later time.

Note that we are using Cygwin Python rather than the standard Windows Python. I had nothing but trouble trying to get Windows Python to play nicely along with virtualenv and Fabric, so this is a compromise. The downside is that you are stuck with a rather dated and somewhat buggy version of Python. If someone manages to get this setup working with Windows Python, then let me know!

Step 2 – Get paramiko working: The python-paramiko and python-crypto packages are required to get Fabric deployment over SSH working properly. If you are lucky, paramiko should work out of the box. If you don’t get the following error message when importing paramiko then skip the rest of this step:

$ python
Python 2.5.2 (r252:60911, Dec  2 2008, 09:26:14)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import paramiko
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "__init__.py", line 69, in <module>
 File "transport.py", line 32, in <module>
 File "util.py", line 31, in <module>
 File "common.py", line 101, in <module>
 File "rng.py", line 69, in __init__
 File "randpool.py", line 87, in __init__
 File "randpool.py", line 120, in _randomize
IOError: [Errno 0] Error

According to the discussion here, this appears to be a lingering Cygwin bug. The workaround is to change line 120 in /usr/lib/python2.5/site-packages/Crypto/Util/randpool.py from


if num!=2 : raise IOError, (num, msg)

to

if num!=2 and num!=0 : raise IOError, (num, msg)

Paramiko should now import without any complaints.

Step 3 – Install setuptools: Setuptools are required for installing the rest of the required Python packages. Instructions for Cygwin are found on the setuptools pages—but just enter the following and you’ll be all set:

$ wget http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c9-py2.5.egg
$ sh setuptools-0.6c9-py2.5.egg

Step 4 – Install pip, virtualenv and virtualenvwrapper: We haven’t said anything about virtualenvwrapper so far. This extension to virtualenv streamlines working with multiple environments and is well recommended:

$ easy_install pip
$ easy_install virtualenv
$ easy_install virtualenvwrapper
$ mkdir ~/.virtualenvs

That last line creates a working directory for your virtual Python environments. When e.g. working with an environment named myenv, all packages will be installed in ~/.virtualenvs/myenv.

I find it useful to create and activate a default environment called sandbox. This helps prevent package installations to the default Python site-packages. It’s a good strategy in general to avoid polluting the main package directory so that almost all package installations are per project and virtual environment. Run the following commands to create the sandbox environment:

$ export WORKON_HOME=$HOME/.virtualenvs
$ export PIP_VIRTUALENV_BASE=$WORKON_HOME
$ source /usr/bin/virtualenvwrapper_bashrc
$ mkvirtualenv sandbox

mkvirtualenv is a virtualenvwrapper command that creates the given environment. If you get an IOError: [Errno 2] No such file or directory: '/usr/local/bin/python2.5' you will have to add a symbolic link to the Python executable:

$ ln -s /usr/bin/python2.5.exe /usr/bin/python2.5

Note that whenever you execute a shell command, the bash prompt will remind you of the active environment:

$ echo "foo"
foo
(sandbox)

To make the sandbox activation permanent, append the following lines to your ~/.bashrc:

export WORKON_HOME=$HOME/.virtualenvs
export PIP_VIRTUALENV_BASE=$WORKON_HOME
source /usr/bin/virtualenvwrapper_bashrc
workon sandbox

The workon is another virtualenvwrapper extension that switches you to the given environment. To get a full list of available environments, type workon without an argument. Other useful commands are deactivate to step out of the currently active environment, and rmvirtualenv to delete an environment. Refer to the virtualenvwrapper documentation for the whole story.

As a sanity check, try exiting and restarting the Cygwin shell. If you have paid attention so far, you should now automatically end up in the sandbox environment.

Step 5 – Install Fabric: From this point and on, all installed packages, including Fabric, will end up in a virtual environment. Fabric is undergoing a major rewrite right now, so given that its interface is quite unstable it is preferable to have a per-project installation anyway.

First we create a test environment named myproject:

$ mkvirtualenv myproject

We have to make some modifications to the Fabric source code, so we can’t use pip for installing it. Make sure to use version 0.9 or higher, as version 0.1 is already quite outdated:

$ mkdir ~/tmp
$ cd ~/tmp
$ wget http://git.fabfile.org/cgit.cgi/fabric/snapshot/fabric-0.9b1.tar.gz
$ tar xzf fabric-0.9b1.tar.gz
$ cd fabric-0.9b1

Fabric is run using the fab command, but if we try to install it as is, the following error might show up:

$ fab
Traceback (most recent call last):
 File "/home/brox/.virtualenvs/myproject/bin/fab", line 8, in <module>
   load_entry_point('Fabric==0.1.1', 'console_scripts', 'fab')()
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 277, in load_entry_point
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 2180, in load_entry_point
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 1913, in load
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/fabric.py"
, line 53, in <module>
   import win32api
ImportError: No module named win32api

At the time of writing there is a small bug in Fabric that is likely to be fixed in the near future. For now you have to manually modify a file in fabric/state.py before you install. Change the line that says

win32 = sys.platform in ['win32', 'cygwin']

to

win32 = sys.platform in ['win32']

This is just to tell Fabric that Cygwin isn’t really Windows and that the win32api module therefore isn’t available. Having made the necessary change, do a regular installation from source:

$ python setup.py install

The following error message about paramiko not being found might pop up; just ignore it:

local packages or download links found for paramiko==1.7.4
error: Could not find suitable distribution for Requirement.parse('paramiko==1.7.4')

And that’s it! You should now have a fully functional virtualenv/Fabric/pip setup. To verify that Fabric works, create a file called fabfile.py:

from fabric.api import local, run

def local_test():
    local('echo "foo"')

def remote_test():
    run('uname -s')

This file, of course, only scratches the surface of what you can do with Fabric—refer to the latest documentation for more information.

To test the fabfile, type the following:

$ fab local_test
[localhost] run: echo "foo"

Done.

The biggest issue is that of getting Fabric to play along with your SSH installation so that you can deploy on remote servers. (You did install the openssh package, right?). Try the following command, substituting test@atbrox.com with one of your own accounts:

$ fab remote_test
No hosts found. Please specify (single) host string for connection: test@atbrox.com
[test@atbrox.com] run: uname -s
Password:
[test@atbrox.com] out: Linux

Done.
Disconnecting from test@atbrox.com... done.

The next step would be to set up password-less logins, but that is a different story.

Afterthoughts: While Cygwin is a lifesaver, it has some quirks and annoyances that may or may not be an issue depending on your system configuration. For instance, on my setup the following error tends to show up randomly when using Fabric for remote deployment:

sem_init: Resource temporarily unavailable
Traceback (most recent call last):
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/main.py", line 454, in main
 File "/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py", line 187, in
deploy
   _prepare_host_global()
 File "/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py", line 137, in
_prepare_host_global
   if not exists(u'/usr/bin/virtualenvwrapper_bashrc'):
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py", line 32, in
exists
 File "/usr/lib/python2.5/contextlib.py", line 33, in __exit__
   self.gen.throw(type, value, traceback)
 File "/usr/lib/python2.5/contextlib.py", line 118, in nested
   yield vars
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py", line 32, in
exists
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/network.py", line 371, in host
_prompting_wrapper
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/operations.py", line 422, in r
un
 File "channel.py", line 297, in recv_exit_status
 File "/usr/lib/python2.5/threading.py", line 368, in wait
   self.__cond.wait(timeout)
 File "/usr/lib/python2.5/threading.py", line 210, in wait
   waiter = _allocate_lock()
thread.error: can't allocate lock

This is a known problem that is not likely to go away anytime soon, due to an inherent race condition in Cygwin’s implementation of sem_init. Still, having a functional virtualenv/Fabric/pip environment on Windows is all in all pretty convenient.

There is a slew of useful articles out there if you need more information on the tools described in this article. These are my current favorites:


Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload