atbrox | Accelerates Innovation with Code

Predicting Startup Performance with Syllables

Posted on September 8, 2010 by Amund Tveit

1. How to name your startup?
It seems to me that few syllables (typically 2) are usually much better performing names (with several exceptions though) than others. My personal theory is that it is about rhythm and pronunciation energy.

2. Company names examples with:
1 syllable: Bing, Ask, Xing, Ning, Dell, Skype, Slide, Yelp, Ford
2 syllables: Google, Yahoo, Ebay, Paypal, Facebook, Quora, Zynga, Youtube, Baidu, Blogger, WordPress, Twitter, LinkedIn, Craigslist, Flickr, Apple, CNet, Tumblr, TwitPic, Reddit, Netflix, SourceForge, Techcrunch, Hulu, bit.ly, Scribd*, Tesla, Samsung, DropBox, AdGrok, Brushes, FanVibe, Gantto, GazeHawk, HipMunk, OhLife, TeeVox
3 syllables: Amazon, LiveJournal, GoDaddy, Mozilla, Mashable, Toyota, Microsoft
4 syllables: Hewlett Packard, Mitsubishi, StumbleUpon
5 syllables: Wikipedia

3. Impact on naming for investors?
I believe naming is so important that even (super) angels, venture capitalists and other investors should weigh this heavily when investing in startups (just take a look at your existing portfolio with Syllable glasses), this belief is backed by Alexa top 500 list[1]. See also 5.

4. How to find a name?
My recommendation is to running a mapreduce where each mappers creates a huge amount of random words or permutations of characters and create a scoring function in the reducer that scores up word with syllables, vowel density, pronouncability and scores down existing names (or domain names) (seed it with a list of brand names) and other unwanted words.

5. Predictions about last batch of Ycombinator companies[2]
a) Predicted 8 best investments from last yc demo day (i.e. have 2 syllables):
AdGrok – online marketing for the masses
Brushes – premiere illustrations with iPad
FanVibe – sports social network
Gantto – project management service
GazeHawk – eye tracking for everyone
Hipmunk – flight search
OhLife – personal journal
Teevox – turns mobile devices into remotes for the Internet

6. Conclusion
It will be interesting to see how the 8 ycombinator startups above – viewed as a fund – perform e.g. relative to the famous new angel funds[3,4]. Perhaps creating a syllable-based index fund could be a thought?

[1] http://www.alexa.com/topsites
[2] http://techcrunch.com/2010/08/24/y-combinator-demo-day-2/
[3] 500 Startups
[4] Felicis Ventures
(* unsure about pronunciation of Scribd)

Best regards,

Amund Tveit, co-founder of Atbrox

Posted in cloud computing | Tagged investing, naming, startups, syllables, ycombinator | 2 Comments

Word Count with MapReduce on a GPU – A Python Example

Posted on August 20, 2010 by Amund Tveit

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research.

GPU – Graphical Processing Unit like the NVIDIA Tesla – is fascinating hardware, in particular regarding extreme parallelism (hundreds of cores) and memory bandwidth (tens of Gigabytes/second). The main programming languages for programming GPUs are C-based OpenCL and Nvidia’s Cuda, in addition there are wrappers to those in many languages, for the following example we use Andreas Klöckner’s PyCuda for Python.

Word Count with PyCuda and MapReduce

One of the classic mapreduce examples is word frequency count (i.e. individual word frequencies), but let us start with an even simpler example – word count, i.e. how many words are there in a (potentially big) string?

In python the default approach would perhaps be to do:

wordcount = len(bigstring.split())

But assuming that you didn’t have split() or that split() was too slow, what would you do?

How to calculate word count?
If you have the string mystring = "this is a string" you could iterate through it and count the number of spaces, e.g. with

sum([1 for c in mystring if c == ' '])

(notice the one-off error), and perhaps split it up and parallelize it somehow. However, if there are several spaces in a row in the string this algorithm will fail, and it doesn’t use the GPU horsepower.

The MapReduce approach
Assuming you still have mystring = "this is a string", try to align the string almost with itself, i.e. have one string being all characters in mystring except the last – "this is a strin" == mystring[:-1] (called prefix from here), and another string with all characters in mystring except the first – "his is a string" == mystring[1:] (called suffix from here), and align those two like this:

this is a strin # prefix
his is a string # suffix

you can see that counting all occurences of when the character in the upper string (prefix) is whitespace and the corresponding character in the lower string (suffix) is non-white will give the correct count of words (with the same one-off as above that can be fixed by checking that first character is non-whitespace). This way of counting also deals with multiple spaces in a row (as the above one doesn’t). This can be expressed in Python with Map() and Reduce() as:

mystring = "this is a string"
prefix = mystring[:-1]
suffix = mystring[1:]
mapoutput = map(lambda x,y: (x == ' ')*(y != ' '), prefix, suffix)
reduceoutput = reduce(lambda x,y: x+y, mapoutput)
sum = reduceoutput + (mystring[0] != ' ') # fix one off-error

Mapreduce with PyCuda

PyCuda supports using python and numpy library with Cuda, and it also has library to support mapreduce type calls on data structures loaded to the GPU (typically arrays), under is my complete code for calculating word count with PyCuda, I used the complete works by Shakespeare as test dataset (downloaded as Plain text) and replicated it hundred times so in total 493820800 bytes (~1/2 Gigabyte) that I uploaded to our Nvidia Tesla C1060 GPU and run word count on (the results were compared with unix command line wc and len(dataset.split()) for smaller datasets).

import pycuda.autoinit
import numpy
from pycuda import gpuarray, reduction
import time

def createCudaWordCountKernel():
    initvalue = "0"
    mapper = "(a[i] == 32)*(b[i] != 32)" # 32 is ascii code for whitespace
    reducer = "a+b"
    cudafunctionarguments = "char* a, char* b"
    wordcountkernel = reduction.ReductionKernel(numpy.float32, neutral = initvalue, 
                                            reduce_expr=reducer, map_expr = mapper,
                                            arguments = cudafunctionarguments)
    return wordcountkernel

def createBigDataset(filename):
    print "reading data"
    dataset = file(filename).read()
    print "creating a big dataset"
    words = " ".join(dataset.split()) # in order to get rid of \t and \n
    chars = [ord(x) for x in words]
    bigdataset = []
    for k in range(100):
        bigdataset += chars
    print "dataset size = ", len(bigdataset)
    print "creating numpy array of dataset"
    bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8)
    return bignumpyarray

def wordCount(wordcountkernel, bignumpyarray):
    print "uploading array to gpu"
    gpudataset = gpuarray.to_gpu(bignumpyarray)
    datasetsize = len(bignumpyarray)
    start = time.time()
    wordcount = wordcountkernel(gpudataset[:-1],gpudataset[1:]).get()
    stop = time.time()
    seconds = (stop-start)
    estimatepersecond = (datasetsize/seconds)/(1024*1024*1024)
    print "word count took ", seconds*1000, " milliseconds"
    print "estimated throughput ", estimatepersecond, " Gigabytes/s"
    return wordcount

if __name__ == "__main__":
    bignumpyarray = createBigDataset("dataset.txt")
    wordcountkernel = createCudaWordCountKernel()
    wordcount = wordCount(wordcountkernel, bignumpyarray)

Results

python wordcount_pycuda.py 
reading data
creating a big dataset, about 1/2 GB of Shakespeare text
dataset size =  493820800
creating numpy array of dataset
uploading array to gpu
word count took  38.4578704834  milliseconds
estimated throughput  11.9587084015  Gigabytes/s (95.67 Gigabit/s)
word count =  89988104.0

Improvement Opportunities?
There are plenty of improvement opportunities, in particular fixing the creation of numpy array – bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8) – which took almost all of the total time.

It is also interesting to notice that this approach doesn’t gain from using combiners like in Hadoop/Mapreduce (a combiner is basically a reducer that sits on the tail of the mapper and creates partial results in the case of associative and commutative reducer methods, it can for all practical purposes be compared to an afterburner on a jet motor).

Best regards,

Amund Tveit (Atbrox co-founder)

Posted in Hadoop and Mapreduce | Tagged cuda, gpu, mapreduce, nvidia, pycuda, tesla | 9 Comments

Statistics about Hadoop and Mapreduce Algorithm Papers

Posted on May 25, 2010 by Amund Tveit

Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.

MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University
Data-intensive text processing with Mapreduce
authors: Jimmy Lin and Chris Dyer – University of Maryland
Large-Scale Behavioral Targeting
authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)
Improving Ad Relevance in Sponsored Search
authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)
Experiences on Processing Spatial Data with MapReduce
authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University
Extracting user profiles from large scale data
authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa
Predicting the Click-Through Rate for Rare/New Ads
authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad
Parallel K-Means Clustering Based on MapReduce
authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas
Map-Reduce Meets Wider Varieties of Applications
authors: Shimin Chen and Steven W. Schlosser – Intel Research
LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)
Efficient Clustering of Web-Derived Data Sets
authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)
A novel approach to multiple sequence alignment using hadoop data grids
authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology
Web-Scale Distributional Similarity and Entity Set Expansion
authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)
Grammar based statistical MT on Hadoop
authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)
Distributed Algorithms for Topic Models
authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine
Parallel algorithms for mining large-scale rich-media data
authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research
Learning Influence Probabilities In Social Networks
authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University
User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
authors: Zhi-Dan Zhao and Ming-sheng Shang

Best regards,

Amund Tveit (Atbrox co-founder)

Posted in cloud computing | Tagged algorithms, china mobile, google, hadoop, mapreduce, yahoo, yandex, zhejian university | 2 Comments

Towards Cloud Supercomputing

Posted on May 24, 2010 by Amund Tveit

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research.

Update 2010-Nov-15: Amazon cluster compute instances enters 231th place on top 500 supercomputing list.

Update 2010-Jul-13: Can remove towards from the title of this posting today, Amazon just launched cluster compute instances with 10GB network bandwidth between nodes (and presents a run that enters top 500 list at 146th place, I estimate the run to cost ~$20k).

The Top 500 list is for supercomputers what Fortune 500 is for companies. About 80% of the list are supercomputers built by either Hewlett Packard or IBM, other major supercomputing vendors on the list include Dell, Sun (Oracle), Cray and SGI. Parallel linpack benchmark result is used as the ranking function for the list position (a derived list – green 500 – also includes power-efficiency in the ranking).adult sex toys
lovense sex toy
air jordan balck
nfl tshirt
nfl san francisco 49ers

custom football jersey

men’s nike air max 90
wig stores
dallas cowboys jersey
nike air max 95
adidas outlet
custom basketball jersey

Trends towards Cloud Supercomputing
To our knowledge the entire top 500 list is currently based on physical supercomputer installations and no cloud computing configurations (i.e. virtual configurations lasting long enough to calculate the linpack benchmark), that will probably change within in a few years. There are however trends towards cloud-based supercomputing already (in particular within consumer internet services and pharmaceutical computations), here are some concrete examples:

Zynga (online casual games, e.g. Farmville and Mafia Wars)
Zynga uses 12000 Amazon EC2 nodes (ref: Manager of Cloud Operations at Zynga)
Animoto (online video production service)
Animoto scaled from 40 to 4000 EC2 nodes in 3 days (ref: CTO, Animoto)
Myspace (social network)
Myspace simulated 1 million simultaneous users using 800 large EC2 nodes (3200 cores) (ref: highscalability.com)
New York Times
New York Times used hundreds of EC2 nodes to process their archives in 36 hours (ref: The New York Times Archives + Amazon Web Services = TimesMachine)
Reddit (news service)
Reddit uses 218 EC2 nodes (ref: I run reddit’s servers)

Examples with (rough) estimates

Justin.tv (video service)
In october 2009 Justin.tv users watched 50 million hours of video, and they cost (reported earlier) was about 1 penny per user-video-hour, a very rough estimate would be monthly costs of 50M/0.01 = 500k$, i.e. 12*500k$ = 6M$ anually. Assuming that half their costs are computational, this would be about 3M$/(24*365*0.085) ~ 4029 EC2 nodes 24×7 through the year, but since they are a video site bandwidth is probably a significant fraction of the cost, so cutting the rough estimate in half to around 2000 EC2 nodes.
(ref: Watching TV Together, Miles Apart and Justin.tv wins funding, opens platform)
Newsweek
Newsweek saves up to $500.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $500.000/(24h/day*365d/y*0.085$/h) ~ 670 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs)
(ref: Newsweek.com Explores Amazon Cloud Computing)
Recovery.gov
Recory.gov saves up to $420.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $420.000/(24h/day*365d/y*0.085$/h) ~ 560 EC2 nodes 24×7 through the year (probably a little less due to storage and bandwidth costs). (ref: Feds embrace cloud computing; move Recovery.gov to Amazon EC2)

Other examples of Cloud Supercomputing

Pharmaceutical companies Eli Lilly, Johnson & Johnson and Genentech
Offloading computations to the cloud (ref: Biotech HPC in the Cloud and The new computing pioneers)
Pathwork Diagnostics
Using EC2 for cancer diagnostics (ref: Of Unknown Origin: Diagnosing Cancer in the Cloud)

Best regards,

Amund Tveit, co-founder of Atbrox

Posted in cloud computing | Tagged amazon, animoto, cray, dell, ec2, eli lilly, genentech, hadoop, ibm, johnson&johnson, justin.tv, mapreduce, microsoft, mpi, oracle, rackspace, sun, supercomputing, zynga | Leave a comment

Predicting Startup Performance with Syllables

Recommended Mapreduce Workshop

Word Count with MapReduce on a GPU – A Python Example

Word Count with PyCuda and MapReduce

Statistics about Hadoop and Mapreduce Algorithm Papers

Towards Cloud Supercomputing

custom football jersey

Archives

Meta