atbr – large-scale in-memory hashtables (in Python)

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

prerequisites:

a) install google sparsehash (and densehash)

wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install

b) install swig

c) compile atbr

make # creates _atbr.so and atbr.py ready to be used from python

python-api example

import atbr

# Create storage
mystore = atbr.Atbr()

# Load data
mystore.load("keyvaluedata.tsv")

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv
 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?

https://github.com/atbrox/atbr

Best regards,

Amund Tveit
Atbrox

This entry was posted in cloud computing and tagged , , , . Bookmark the permalink.

5 Responses to atbr – large-scale in-memory hashtables (in Python)

  1. Pingback: atbr now has Apache Thrift support

  2. Ty says:

    If i use put to add, what do i use to remove an element, and delete all the elements.

  3. Amund Tveit says:

    Hi, haven’t created delete api yet, will add a comment to this post if/when I do.

    Best regards,
    Amund

  4. Oded R. says:

    Hello Amund,

    Is a release with pair removal (by key) planned?
    If not, can you please estimate how difficult if would be for me (as someone who isn’t familiar with the “insides” of atbr) to implement element removal myself?

    Thanks,
    – Oded

  5. Amund Tveit says:

    Hello Oded, unfortunately still no progress on delete/removal API (and no current plans to), but believe it could be done relatively easily if you look into the code. Will recommend you to look into atbr/atbr.h and atbr/atbrpy.i in order to add this.

    Best regards,
    Amund

Leave a Reply

Your email address will not be published. Required fields are marked *