Apr 25

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.


a) install google sparsehash (and densehash)

wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install

b) install swig

c) compile atbr

make # creates _atbr.so and atbr.py ready to be used from python

python-api example

import atbr

# Create storage
mystore = atbr.Atbr()

# Load data

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv
 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?


Best regards,

Amund Tveit

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload