May 24
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research.
Update 2010-July-13: Can remove towards from the title of this posting today, Amazon just launched cluster compute instances with 10GB network bandwidth between nodes (and presents a run that enters top 500 list at 146th place, I estimate the run to cost ~$20k).
The Top 500 list is for supercomputers what Fortune 500 is for companies. About 80% of the list are supercomputers built by either Hewlett Packard or IBM, other major supercomputing vendors on the list include Dell, Sun (Oracle), Cray and SGI. Parallel linpack benchmark result is used as the ranking function for the list position (a derived list – green 500 – also includes power-efficiency in the ranking).
Trends towards Cloud Supercomputing
To our knowledge the entire top 500 list is currently based on physical supercomputer installations and no cloud computing configurations (i.e. virtual configurations lasting long enough to calculate the linpack benchmark), that will probably change within in a few years. There are however trends towards cloud-based supercomputing already (in particular within consumer internet services and pharmaceutical computations), here are some concrete examples:
- Zynga (online casual games, e.g. Farmville and Mafia Wars)
Zynga uses 12000 Amazon EC2 nodes (ref: Manager of Cloud Operations at Zynga)
- Animoto (online video production service)
Animoto scaled from 40 to 4000 EC2 nodes in 3 days (ref: CTO, Animoto)
- Myspace (social network)
Myspace simulated 1 million simultaneous users using 800 large EC2 nodes (3200 cores) (ref: highscalability.com)
- New York Times
New York Times used hundreds of EC2 nodes to process their archives in 36 hours (ref: The New York Times Archives + Amazon Web Services = TimesMachine)
- Reddit (news service)
Reddit uses 218 EC2 nodes (ref: I run reddit’s servers)
Examples with (rough) estimates
- Justin.tv (video service)
In october 2009 Justin.tv users watched 50 million hours of video, and they cost (reported earlier) was about 1 penny per user-video-hour, a very rough estimate would be monthly costs of 50M/0.01 = 500k$, i.e. 12*500k$ = 6M$ anually. Assuming that half their costs are computational, this would be about 3M$/(24*365*0.085) ~ 4029 EC2 nodes 24×7 through the year, but since they are a video site bandwidth is probably a significant fraction of the cost, so cutting the rough estimate in half to around 2000 EC2 nodes.
(ref: Watching TV Together, Miles Apart and Justin.tv wins funding, opens platform)
- Newsweek
Newsweek saves up to $500.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $500.000/(24h/day*365d/y*0.085$/h) ~ 670 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs)
(ref: Newsweek.com Explores Amazon Cloud Computing)
- Recovery.gov
Recory.gov saves up to $420.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $420.000/(24h/day*365d/y*0.085$/h) ~ 560 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs). (ref: Feds embrace cloud computing; move Recovery.gov to Amazon EC2)
Other examples of Cloud Supercomputing
- Pharmaceutical companies Eli Lilly, Johnson & Johnson and Genentech
Offloading computations to the cloud (ref: Biotech HPC in the Cloud and The new computing pioneers)
- Pathwork Diagnostics
Using EC2 for cancer diagnostics (ref: Of Unknown Origin: Diagnosing Cancer in the Cloud)
Best regards,
Amund Tveit
Tagged with: amazon • animoto • cray • dell • ec2 • eli lilly • genentech • hadoop • ibm • johnson&johnson • justin.tv • mapreduce • microsoft • mpi • oracle • rackspace • sun • supercomputing • zynga
May 08
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. . Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce
This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
Who wrote the above papers?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

Best regards,
Amund Tveit (Atbrox co-founder)
Tagged with: google • hadoop • machinelearning • mapreduce • yahoo
Feb 28
Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research.
Yahoo recently announced the Learning to Rank Challenge – a pretty interesting web search challenge (as the somewhat similar Netflix Prize Challenge also was).
Data and Problem
The data sets contains (to my interpretation) per line:
- url – implicitly encoded as line number in the data set file
- relevance – low number=high relevance and vice versa
- query – represented as an id
- features – up to several hundreds
and the problem is to find a function that gives relevance numbers per url per query id.
Initial Observation
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper Learning to Rank with Ties seems potentially relevant to deal with this.

Multiple URLs that shares relevance for a unique query can perhaps be due to:
- similar/duplicate content between the URLs?
- a frequent query (due to sampling of examples?)
- uncertainty about which URL to select for particular a relevance and query?
- there is a tie, i.e. they are equally relevant
Potential classification approach?
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:
- Use relevance levels as classes (nominal regression) and use a multiclass-classifier
- Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
- Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
- support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
- use a classifier that supports weighing examples and the approach in 1 or 2.
More about ranking with machine learning?
Check out the learning to rank bibliography.
Best regards,
Amund Tveit
Tagged with: classification • machine learning • netflix • ranking • regression • relevance • search • yahoo