Nov 14
We developed a tool for scalable language processing for our customer Lingit using Amazon’s Elastic Mapreduce.
More details: http://aws.amazon.com/solutions/case-studies/atbrox/
Contact us if you need help with Hadoop/Elastic Mapreduce.
Tagged with: data processing • elastic mapreduce • hadoop • language processing • nlp
Oct 07
Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
- Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
- Skim quickly through the Shedskin tutorial
- Log into the EC2 node and install the Shedskin Python compiler
- Write your Python mapper or reducer program and compile it into C++ with Shedskin
- E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
- Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
- note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
- Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
- Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
- Run strip on the binary to make it smaller
- Upload your (ready) binary to a chosen location in Amazon S3
- Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.
Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.
Note: this approach should probably work also with Cloudera’s distribution for Hadoop.
Tagged with: aws • c++ • elastic mapreduce • hadoop • mapreduce • python • shedskin