Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
- Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
- note: We used Alestic‘s ami-ff46a796
- Skim quickly through the Shedskin tutorial
- Log into the EC2 node and install the Shedskin Python compiler
- Shedskin requires a few libraries: 1) the Boehm-Demers-Weiser garbage collector for C++, 2) PCRE – Perl Compatible Regular Expressions. The Shedskin tutorial for detailed install instructions.
- note: The Alestic Debian AMI is fairly slim, so we had to add some more software make Shedskin work, i.e. GDB
- Write your Python mapper or reducer program and compile it into C++ with Shedskin
- E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
- Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
- note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
- Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
- Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
- Run strip on the binary to make it smaller
- Upload your (ready) binary to a chosen location in Amazon S3
- e.g. via commandline with S3CMD, with a UI using S3Fox or Cloudberry S3 Explorer or programmatically with Boto.
- Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.
- note: Peter Skomoroch has written a good tutorial for Elastic Mapreduce
Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.
Note: this approach should probably work also with Cloudera’s distribution for Hadoop.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or
contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Pingback: Alejandro Barrero Plazas (coyr) 's status on Wednesday, 07-Oct-09 15:04:29 UTC - Identi.ca
Pingback: How to use C++ Compiled Python for Amazon’s Elastic Mapreduce (Hadoop) › ec2base
Pingback: शंतनू महाजन (shantanoo) 's status on Thursday, 08-Oct-09 15:58:08 UTC - Identi.ca
Pingback: शंतनू महाजन (shantanoo) 's status on Thursday, 08-Oct-09 16:01:45 UTC - Identi.ca
Pingback: Recent Experiences Being on Hacker News First Page | Amund Tveit's Blog