Oct 07

Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:

  1. Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
  2. Skim quickly through the Shedskin tutorial
  3. Log into the EC2 node and install the Shedskin Python compiler
  4. Write your Python mapper or reducer program and compile it into C++ with Shedskin
    • E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
  5. Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
    • note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
  6. Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
  7. Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
  8. Run strip on the binary to make it smaller
  9. Upload your (ready) binary to a chosen location in Amazon S3
  10. Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.

Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.

Note: this approach should probably work also with Cloudera’s distribution for Hadoop.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or

contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

7 Responses to “How to use C++ Compiled Python for Amazon’s Elastic Mapreduce (Hadoop)”

  1. Alejandro Barrero Plazas (coyr) 's status on Wednesday, 07-Oct-09 15:04:29 UTC - Identi.ca Says:

    […] http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/ a few seconds ago from seesmic […]

  2. How to use C++ Compiled Python for Amazon’s Elastic Mapreduce (Hadoop) › ec2base Says:

    […] http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/ […]

  3. शंतनू महाजन (shantanoo) 's status on Thursday, 08-Oct-09 15:58:08 UTC - Identi.ca Says:

    […] http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/ a few seconds ago from Gravity […]

  4. शंतनू महाजन (shantanoo) 's status on Thursday, 08-Oct-09 16:01:45 UTC - Identi.ca Says:

    […] http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/ about 5 minutes ago from Gravity […]

  5. Andy Says:

    Small correction – you probably wanted to mention S3Fox, which is S3 management tool, not Elastic Fox, which the tool for EC2. You can also upload files using CloudBerry Explorer freeware.

  6. Amund Says:

    Andy, thanks for correction. Will fix it right away. (I use primarily S3CMD myself)

    Amund

  7. Recent Experiences Being on Hacker News First Page | Amund Tveit's Blog Says:

    […] page of Hacker News. The other 6 where: #1 (HN#1), #2 (HN#2), #3 (HN#3), #4 (HN#4), #5 (HN#5) and #6 […]

Leave a Reply

preload preload preload