Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research.
GPU – Graphical Processing Unit like the NVIDIA Tesla – is fascinating hardware, in particular regarding extreme parallelism (hundreds of cores) and memory bandwidth (tens of Gigabytes/second). The main programming languages for programming GPUs are C-based OpenCL and Nvidia’s Cuda, in addition there are wrappers to those in many languages, for the following example we use Andreas Klöckner’s PyCuda for Python.
Word Count with PyCuda and MapReduce
One of the classic mapreduce examples is word frequency count (i.e. individual word frequencies), but let us start with an even simpler example – word count, i.e. how many words are there in a (potentially big) string?
In python the default approach would perhaps be to do:
wordcount = len(bigstring.split())
But assuming that you didn’t have split() or that split() was too slow, what would you do?
How to calculate word count?
If you have the string
mystring = "this is a string" you could iterate through it and count the number of spaces, e.g. with
sum([1 for c in mystring if c == ' '])
(notice the one-off error), and perhaps split it up and parallelize it somehow. However, if there are several spaces in a row in the string this algorithm will fail, and it doesn’t use the GPU horsepower.
The MapReduce approach
Assuming you still have
mystring = "this is a string", try to align the string almost with itself, i.e. have one string being all characters in
mystring except the last –
"this is a strin" == mystring[:-1] (called prefix from here), and another string with all characters in
mystring except the first –
"his is a string" == mystring[1:] (called suffix from here), and align those two like this:
this is a strin # prefix his is a string # suffix
you can see that counting all occurences of when the character in the upper string (prefix) is whitespace and the corresponding character in the lower string (suffix) is non-white will give the correct count of words (with the same one-off as above that can be fixed by checking that first character is non-whitespace). This way of counting also deals with multiple spaces in a row (as the above one doesn’t). This can be expressed in Python with
mystring = "this is a string" prefix = mystring[:-1] suffix = mystring[1:] mapoutput = map(lambda x,y: (x == ' ')*(y != ' '), prefix, suffix) reduceoutput = reduce(lambda x,y: x+y, mapoutput) sum = reduceoutput + (mystring != ' ') # fix one off-error
Mapreduce with PyCuda
PyCuda supports using python and numpy library with Cuda, and it also has library to support mapreduce type calls on data structures loaded to the GPU (typically arrays), under is my complete code for calculating word count with PyCuda, I used the complete works by Shakespeare as test dataset (downloaded as Plain text) and replicated it hundred times so in total 493820800 bytes (~1/2 Gigabyte) that I uploaded to our Nvidia Tesla C1060 GPU and run word count on (the results were compared with unix command line wc and len(dataset.split()) for smaller datasets).
import pycuda.autoinit import numpy from pycuda import gpuarray, reduction import time def createCudaWordCountKernel(): initvalue = "0" mapper = "(a[i] == 32)*(b[i] != 32)" # 32 is ascii code for whitespace reducer = "a+b" cudafunctionarguments = "char* a, char* b" wordcountkernel = reduction.ReductionKernel(numpy.float32, neutral = initvalue, reduce_expr=reducer, map_expr = mapper, arguments = cudafunctionarguments) return wordcountkernel def createBigDataset(filename): print "reading data" dataset = file(filename).read() print "creating a big dataset" words = " ".join(dataset.split()) # in order to get rid of \t and \n chars = [ord(x) for x in words] bigdataset =  for k in range(100): bigdataset += chars print "dataset size = ", len(bigdataset) print "creating numpy array of dataset" bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8) return bignumpyarray def wordCount(wordcountkernel, bignumpyarray): print "uploading array to gpu" gpudataset = gpuarray.to_gpu(bignumpyarray) datasetsize = len(bignumpyarray) start = time.time() wordcount = wordcountkernel(gpudataset[:-1],gpudataset[1:]).get() stop = time.time() seconds = (stop-start) estimatepersecond = (datasetsize/seconds)/(1024*1024*1024) print "word count took ", seconds*1000, " milliseconds" print "estimated throughput ", estimatepersecond, " Gigabytes/s" return wordcount if __name__ == "__main__": bignumpyarray = createBigDataset("dataset.txt") wordcountkernel = createCudaWordCountKernel() wordcount = wordCount(wordcountkernel, bignumpyarray)
python wordcount_pycuda.py reading data creating a big dataset, about 1/2 GB of Shakespeare text dataset size = 493820800 creating numpy array of dataset uploading array to gpu word count took 38.4578704834 milliseconds estimated throughput 11.9587084015 Gigabytes/s (95.67 Gigabit/s) word count = 89988104.0
There are plenty of improvement opportunities, in particular fixing the creation of numpy array –
bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8) – which took almost all of the total time.
It is also interesting to notice that this approach doesn’t gain from using combiners like in Hadoop/Mapreduce (a combiner is basically a reducer that sits on the tail of the mapper and creates partial results in the case of associative and commutative reducer methods, it can for all practical purposes be compared to an afterburner on a jet motor).
Amund Tveit (Atbrox co-founder)
Could you add the performance comparision with dataset.split() and wc -w? I can’t find it in this article.
Hello! Thank you for very interesting post.
If the creation of numpy array takes almost all of the total time one could use built-in function numpy.fromfile(filename, dtype=numpy.int8) – it’s very fast.
Would be interested in benchmark numbers, I am wondering if the cuda version is any faster than the python builtin map/reduce code. Although I understand this blog post is not about absolute performance…
sinx and Giorgi: I’m mainly interested in ram – gpu cooperation, so I probably won’t add benchmarks involving disk (but feel free to create follow-up posts with benchmarks)
Pingback: Map Reduce Now. Right Now. | Infochimps Blog
Pingback: [imported]cudaでmapreduce Feb 24th, 2011 7:39pm « nashibao blog
Pingback: Data-Parallel Programming with Metal and Swift for iPhone/iPad GPU | Memkite
Pingback: Data-Parallel Programming with Metal and Swift for iPhone/iPad GPU | Amund Tveit's Blog
Out of curiosity, do you have any advice about the integration of PyCUDA with Hadoop? (The Hadoop Streaming tool would have to be involved).