80legs is a company specializing in the crawling and preprocessing part of search, where you can upload your seed urls (where to start crawling), configure your crawl job (depth, domain restrictions etc.) and also run existing or custom analysis code (upload java jar-files) on the fetched pages. When you upload seed files 80legs does some filtering before starting to crawl (e.g. if you have seed urls which are not well-formed), and also handles domain throttling and robots.txt (and perhaps other things).
Computational model: Since you can run custom code per page it can be seen as a mapper part of a MapReduce (Hadoop) job (one map() call per page); for reduce-type processing (over several pages) you need to move your data elsewhere (e.g. EC2 in the cloud). Side note: another domain with “reduce-less” mapreduce is quantum computing, check out Michael Nilsen’s Quantum Computing for Everyone.
Note: We have only tried with the built-in functionality and no custom code so far.
1) URL extraction
Job description: We used a seed of approximately 1,000 URLs and crawled and analyzed ~2.8 million pages within those domains. The regexp configuration was used (we only provided the URL matching regexp).
Result: Approximately 1 billion URLs were found, and results came in 106 zip-files (each ~14MB packed and ~100MB unpacked) in addition to zip files of the URLs that where crawled.
Note: Based on a few smaller similar jobs it looks like the parallelism of 80legs is somewhat dependent of the number of domains in the crawl and perhaps also on their ordering. In case you have a set of URLs where each domain has more than one URL it can be useful to randomize your seed URL file before uploading and running the crawl job, e.g. by using rl or coreutil’s shuf.
2) Fetching pages
Job description: We built a set of URLs – ~80k URLs that we wanted to fetch as html (using their sample application called 80App Get Raw HTML) for further processing. The URLs were split into 4 jobs of ~20k URLs each.
Result: Each job took roughly one hour (they all ran in parallel so the total time spent was 1 hour). We ended up with 5 zip files per job, each zip file having ~25MB of data (100MB unpacked), i.e. ~4*5*100MB = 2GB raw html when unpacked for all jobs.
80legs is an interesting service that has already proved useful for us, and we will continue to use it in combination with AWS and EC2. Custom code needs to be built (e.g. related to ajax crawling).
Amund Tveit, co-founder of Atbrox