We developed a tool for scalable language processing for our customer Lingit using Amazon’s Elastic Mapreduce.
More details: http://aws.amazon.com/solutions/case-studies/atbrox/
Contact us if you need help with Hadoop/Elastic Mapreduce.
We developed a tool for scalable language processing for our customer Lingit using Amazon’s Elastic Mapreduce.
More details: http://aws.amazon.com/solutions/case-studies/atbrox/
Contact us if you need help with Hadoop/Elastic Mapreduce.
Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each mapper and reducer, Cloudera’s Jeff Hammerbacher outlined how to do that using Hadoop Distributed Cache and Peter Skomorroch suggested how to load Boto to access Elastic Blockstore (EBS), this posting is based on those ideas and gives a detailed description how to do it.
How to combine Elastic Mapreduce with other AWS Services
This posting shows how to load boto in an Elastic Mapreduce mapper and gives a simple example how to use simpledb from the same mapper. For accessing other AWS services, e.g. SQS from Elastic Mapreduce check out the Boto documentation (it is quite easy when the boto + emr integration is in place).
Other tools used (prerequisites):
Step 1 – getting and preparing the Boto library
wget http://boto.googlecode.com/files/boto-1.8d.tar.gz # note: using virtualenv can be useful if you want to # keep your local Python installation clean tar -zxvf boto-1.8d.tar.gz ; cd boto-1.8d ; python setup.py install cd /usr/local/lib/python2.6/dist-packages/boto-1.8d-py2.6.egg zip -r boto.mod boto
Step 2 – mapper that loads boto.mod and uses it to access SimpleDB
# this was tested by adding code underneath to the mapper # s3://elasticmapreduce/samples/wordcount/wordSplitter.py # get boto library sys.path.append(".") import zipimport importer = zipimport.zipimporter('boto.mod') boto = importer.load_module('boto') # access simpledb sdb = boto.connect_sdb("YourAWSKey", "YourSecretAWSKey") sdb_domain = boto.create_domain("mymapreducedomain") # or get_domain() # .. # write words to simpledb for word in pattern.findall(line): item = sdb_domain.create_item(word) item["reversedword"] = word[::-1] item.save() # ...
Step 3 – json config file – bototest.json – for Elastic Mapreduce Ruby Client
[ { "Name": "Step 1: testing boto with elastic mapreduce", "ActionOnFailure": "<action_on_failure>", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar", "Args": [ "-input", "s3n://elasticmapreduce/samples/wordcount/input", "-output", "s3n://yours3bucket/result", "-mapper", "s3://yours3bucket/botoWordSplitter.py", "-cacheFile", "s3n://yours3bucket/boto.mod#boto.mod", ] } } ]
Step 4 – Copy necessary files to s3
s3cmd put boto.mod s3://yours3bucket s3cmd put botoWordSplitter.py s3://yours3bucket
Step 5 – And run your Elastic Mapreduce job
elastic-mapreduce --create \ --stream \ --json bototest.json \ --param "<action_on_failure>=TERMINATE_JOB_FLOW"
Conclusion
This showed how to dynamically load boto and use it to access one other AWS service – SimpleDB – from Elastic Mapreduce. Boto supports most AWS services, so the same integration approach should work also for other AWS services, e.g. SQS (Queuing Service), RDS (MySQL Service) and EC2, check out the Boto API documentation or Controlling the Cloud with Python for details.
Note: a very similar integration approach should work for most Python libraries, also those that use/wrap C/C++ code (e.g. machine learning libraries such as PyML and others), but then it might be needed to do step 1 on Debian AMIs similar to what Elastic Mapreduce is using, check out a previous posting for more info about such AMIs.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
80legs is a company specializing in the crawling and preprocessing part of search, where you can upload your seed urls (where to start crawling), configure your crawl job (depth, domain restrictions etc.) and also run existing or custom analysis code (upload java jar-files) on the fetched pages. When you upload seed files 80legs does some filtering before starting to crawl (e.g. if you have seed urls which are not well-formed), and also handles domain throttling and robots.txt (and perhaps other things).
Computational model: Since you can run custom code per page it can be seen as a mapper part of a MapReduce (Hadoop) job (one map() call per page); for reduce-type processing (over several pages) you need to move your data elsewhere (e.g. EC2 in the cloud). Side note: another domain with “reduce-less” mapreduce is quantum computing, check out Michael Nilsen’s Quantum Computing for Everyone.
Testing 80legs
Note: We have only tried with the built-in functionality and no custom code so far.
1) URL extraction
Job description: We used a seed of approximately 1,000 URLs and crawled and analyzed ~2.8 million pages within those domains. The regexp configuration was used (we only provided the URL matching regexp).
Result: Approximately 1 billion URLs were found, and results came in 106 zip-files (each ~14MB packed and ~100MB unpacked) in addition to zip files of the URLs that where crawled.
Note: Based on a few smaller similar jobs it looks like the parallelism of 80legs is somewhat dependent of the number of domains in the crawl and perhaps also on their ordering. In case you have a set of URLs where each domain has more than one URL it can be useful to randomize your seed URL file before uploading and running the crawl job, e.g. by using rl or coreutil’s shuf.
2) Fetching pages
Job description: We built a set of URLs – ~80k URLs that we wanted to fetch as html (using their sample application called 80App Get Raw HTML) for further processing. The URLs were split into 4 jobs of ~20k URLs each.
Result: Each job took roughly one hour (they all ran in parallel so the total time spent was 1 hour). We ended up with 5 zip files per job, each zip file having ~25MB of data (100MB unpacked), i.e. ~4*5*100MB = 2GB raw html when unpacked for all jobs.
Conclusion
80legs is an interesting service that has already proved useful for us, and we will continue to use it in combination with AWS and EC2. Custom code needs to be built (e.g. related to ajax crawling).
(May 2000 – A few thoughts about the future of Internet Information Retrieval)
Best regards,
Amund Tveit, co-founder of Atbrox
SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants “heavier data” in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)
Unstructured Search for SimpleDB
The Structure of SimpleDB
SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable) has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name). Check out Wikipedia for detailed SimpleDB storage characteristics.
Inverted files
Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey – Inverted files for text search engines)
One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from “0” to “249” and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item – e.g. for stop words – one could “wrap” the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).
Preliminary query latency results
Warning: Data sets used were NLTK‘s inaugural collection, so far from the biggest.
Conclusion: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.
Preliminary Python code including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).
Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.
Note: this approach should probably work also with Cloudera’s distribution for Hadoop.
Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or