
Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.
Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.
prerequisites:
a) install google sparsehash (and densehash)
wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz tar -zxvf sparsehash-2.0.2.tar.gz cd sparsehash-2.0.2 ./configure && make && make install
b) install swig
c) compile atbr
make # creates _atbr.so and atbr.py ready to be used from python
python-api example
import atbr
# Create storage
mystore = atbr.Atbr()
# Load data
mystore.load("keyvaluedata.tsv")
# Number of key value pairs
print mystore.size()
# Get value corresponding to key
print mystore.get("key1")
# Return true if a key exists
print mystore.exists("key1")
benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.
$ ls -al medium.tsv -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
$ wc medium.tsv 212969 5835001 117362571 medium.tsv
$ python
>>> import atbr
>>> a = atbr.Atbr()
>>> a.load("medium.tsv")
Inserting took - 1.178468 seconds
Num new key-value pairs = 212969
Speed: 180716.807959 key-value pairs per second
Throughput: 94.803214 MB per second
Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.
Where can I find the code?
https://github.com/atbrox/atbr
Best regards,
Amund Tveit
Atbrox
Pingback: atbr now has Apache Thrift support