Nov 14
We developed a tool for scalable language processing for our customer Lingit using Amazon’s Elastic Mapreduce.
More details: http://aws.amazon.com/solutions/case-studies/atbrox/
Contact us if you need help with Hadoop/Elastic Mapreduce.
Tagged with: amazon • aws • data processing • elastic mapreduce • hadoop • language processing • nlp
Oct 07
Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
- Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
- Skim quickly through the Shedskin tutorial
- Log into the EC2 node and install the Shedskin Python compiler
- Write your Python mapper or reducer program and compile it into C++ with Shedskin
- E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
- Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
- note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
- Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
- Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
- Run strip on the binary to make it smaller
- Upload your (ready) binary to a chosen location in Amazon S3
- Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.
Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.
Note: this approach should probably work also with Cloudera’s distribution for Hadoop.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or
contact
Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce –
info@atbrox.com. See
our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Tagged with: amazon • aws • c++ • elastic mapreduce • hadoop • mapreduce • python • shedskin
Sep 07
We are here to help you:
- Understand if and how the cloud can be cost-efficient in your setting
- Efficiently analyze large data sets using the cloud
- Architect, develop and deploy scalable and reliable software for the cloud
- Adapt and migrate your existing data and software to the cloud
Technologies and methods we (non-exclusively) use:
Our motto is Simplicity, Automation and Scalability
If you are considering using cloud computing, please drop us a line to info (at) atbrox.com
Tagged with: amazon • automation • aws • cloud • data analysis • elastic mapreduce • hadoop • mapreduce • scalability • simplicity