atbr – large-scale in-memory hashtables (in Python)

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.


a) install google sparsehash (and densehash)

tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install

b) install swig

c) compile atbr

make # creates and ready to be used from python

python-api example

import atbr

# Create storage
mystore = atbr.Atbr()

# Load data

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv
 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?

Best regards,

Amund Tveit

This entry was posted in cloud computing and tagged , , , . Bookmark the permalink.

5 Responses to atbr – large-scale in-memory hashtables (in Python)

  1. Pingback: atbr now has Apache Thrift support