atbr – large-scale in-memory hashtables (in Python)

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

prerequisites:

a) install google sparsehash (and densehash)

wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install

b) install swig

c) compile atbr

make # creates _atbr.so and atbr.py ready to be used from python

python-api example

import atbr

# Create storage
mystore = atbr.Atbr()

# Load data
mystore.load("keyvaluedata.tsv")

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv

 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv

 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?

https://github.com/atbrox/atbr

Best regards,

Amund Tveit
Atbrox

5 Responses to atbr – large-scale in-memory hashtables (in Python)

Archives

Meta