Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.
Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.
a) install google sparsehash (and densehash)
wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz tar -zxvf sparsehash-2.0.2.tar.gz cd sparsehash-2.0.2 ./configure && make && make install
b) install swig
c) compile atbr
make # creates _atbr.so and atbr.py ready to be used from python
import atbr # Create storage mystore = atbr.Atbr() # Load data mystore.load("keyvaluedata.tsv") # Number of key value pairs print mystore.size() # Get value corresponding to key print mystore.get("key1") # Return true if a key exists print mystore.exists("key1")
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.
$ ls -al medium.tsv -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
$ wc medium.tsv 212969 5835001 117362571 medium.tsv
$ python >>> import atbr >>> a = atbr.Atbr() >>> a.load("medium.tsv") Inserting took - 1.178468 seconds Num new key-value pairs = 212969 Speed: 180716.807959 key-value pairs per second Throughput: 94.803214 MB per second
Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.
Where can I find the code?