Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
- Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
- Skim quickly through the Shedskin tutorial
- Log into the EC2 node and install the Shedskin Python compiler
- Shedskin requires a few libraries: 1) the Boehm-Demers-Weiser garbage collector for C++, 2) PCRE – Perl Compatible Regular Expressions. The Shedskin tutorial for detailed install instructions.
- note: The Alestic Debian AMI is fairly slim, so we had to add some more software make Shedskin work, i.e. GDB
- Write your Python mapper or reducer program and compile it into C++ with Shedskin
- E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
- Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
- note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
- Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
- Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
- Run strip on the binary to make it smaller
- Upload your (ready) binary to a chosen location in Amazon S3
- Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.
Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.
Note: this approach should probably work also with Cloudera’s distribution for Hadoop.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or