Nov 11

Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each mapper and reducer, Cloudera’s Jeff Hammerbacher outlined how to do that using Hadoop Distributed Cache and Peter Skomorroch suggested how to load Boto to access Elastic Blockstore (EBS), this posting is based on those ideas and gives a detailed description how to do it.

How to combine Elastic Mapreduce with other AWS Services

This posting shows how to load boto in an Elastic Mapreduce mapper and gives a simple example how to use simpledb from the same mapper. For accessing other AWS services, e.g. SQS from Elastic Mapreduce check out the Boto documentation (it is quite easy when the boto + emr integration is in place).

Other tools used (prerequisites):

Step 1 – getting and preparing the Boto library

wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
# note: using virtualenv can be useful if you want to
# keep your local Python installation clean
tar -zxvf boto-1.8d.tar.gz ; cd boto-1.8d ; python setup.py install
cd /usr/local/lib/python2.6/dist-packages/boto-1.8d-py2.6.egg
zip -r boto.mod boto

Step 2 – mapper that loads boto.mod and uses it to access SimpleDB

# this was tested by adding code underneath to the mapper
# s3://elasticmapreduce/samples/wordcount/wordSplitter.py

# get boto library
sys.path.append(".")
import zipimport
importer = zipimport.zipimporter('boto.mod')
boto = importer.load_module('boto')

# access simpledb
sdb = boto.connect_sdb("YourAWSKey", "YourSecretAWSKey")
sdb_domain = boto.create_domain("mymapreducedomain") # or get_domain()
# ..
# write words to simpledb
  for word in pattern.findall(line):
      item = sdb_domain.create_item(word)
      item["reversedword"] = word[::-1]
      item.save()
      # ...

Step 3 – json config file – bototest.json – for Elastic Mapreduce Ruby Client

[	
  { 
	"Name": "Step 1: testing boto with elastic mapreduce", 
        "ActionOnFailure": "<action_on_failure>", 
        "HadoopJarStep": { 
		"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar", 
          	"Args": [ 
            	"-input", "s3n://elasticmapreduce/samples/wordcount/input", 
            	"-output", "s3n://yours3bucket/result",
            	"-mapper", "s3://yours3bucket/botoWordSplitter.py",
            	"-cacheFile", "s3n://yours3bucket/boto.mod#boto.mod",
          	] 
        } 
  }
]

Step 4 – Copy necessary files to s3

s3cmd put boto.mod s3://yours3bucket
s3cmd put botoWordSplitter.py s3://yours3bucket

Step 5 – And run your Elastic Mapreduce job

 elastic-mapreduce --create \
                   --stream \
                   --json bototest.json \
                   --param "<action_on_failure>=TERMINATE_JOB_FLOW"

Conclusion
This showed how to dynamically load boto and use it to access one other AWS service – SimpleDB – from Elastic Mapreduce. Boto supports most AWS services, so the same integration approach should work also for other AWS services, e.g. SQS (Queuing Service), RDS (MySQL Service) and EC2, check out the Boto API documentation or Controlling the Cloud with Python for details.

Note: a very similar integration approach should work for most Python libraries, also those that use/wrap C/C++ code (e.g. machine learning libraries such as PyML and others), but then it might be needed to do step 1 on Debian AMIs similar to what Elastic Mapreduce is using, check out a previous posting for more info about such AMIs.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

4 Responses to “How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services”

  1. How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services › ec2base Says:

    [...] http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/ [...]

  2. Ben Darfler Says:

    Thanks for this post, I’m waiting for a test run to complete on EMR using your zipimport trick. I’m curious though, do you know why trying to use boto with the ERM -archives flag doesn’t work? I have simplejson working fine this way but it seems to fail for boto.

  3. nic Says:

    This article if very helpful, but it may be a little outdated. In April 2010 Amazon started to allow Elastic MapReduce users to execute a bootstrap script in EMR instances. This will solve the problems of most people by letting you download the files you need and even install libraries via apt or python’s easy_install.

    Take a look at http://aws.typepad.com/aws/2010/04/new-elastic-mapreduce-feature-bootstrap-actions.html

  4. nic Says:

    More updated info also in this same blog: http://atbrox.com/2010/10/01/programmatic-deployment-to-elastic-mapreduce-with-boto-and-bootstrap-action/

Leave a Reply

preload preload preload