Main takeaways from Accel’s Big Data Conference

Attended Accel Partners Big Data conference last week. It was a good event with many interesting people, a very crude estimate of distribution: 1/3 VCs/investors, 1/3 startup tech people, 1/3 big corp tech people.

My personal 2 key takeaways from the conference:

  1. Realtime processing: hot topic with many companies creating their own custom solutions, but wouldn’t object having an exceptionally good opensource solution to gather around.
  2. Low-latency storage: emerging topic – or as quoted from the talk by Andy Becholsteim’s (Sun/Arista/Granite/Kealia/HighBAR co-founder and early Google-investor): “Hard Disk Drives are not keeping up. Flash solving this problem just in time”. The academic session had also interesting discussions regarding RAM-based storage.

I think Andy Becholsteim’s table titled “Memory Hierarchi is Not Changing” sums up the low-latency storage discussion quite good. I’ve taken the liberty to add a column with rough prices per Petabyte-month (calculation: estimated purchase-price divided by 12, note only the storage itself – not including all the hardware/network in order to run it) for RAM and SSD which are the only ones fit for low-latency AND big data. Note: I think mr. Becholsteim could have added up to petabytes for both SSD and RAM.

Type of memory Size Latency $ per Petabyte-month* (k$)
L1 cache 64 KB ~4 cycles (2 ns)
L2 cache 256 KB ~10 cycles (5 ns)
L3 cache (shared) 8 MB 35-40+ cycles (20 ns)
Main memory GBs up to terabytes 100-400 cycles 411 (non-ECC)
1,197 (ECC)
Solid state memory GBs up to terabytes 5,000 cycles 94
Disk Up to petabytes 1,000,000 cycles

*Storage price sources and calculations used

RAM (non-ECC): 16GB non-ECC (2x8GB) – price: $79, i.e. $79/16 per GB, $(79/16)K per TB, $(79/16)M per PB, $(79/16)M/12 per PB-month
RAM (ECC): 16GB ECC (1x16GB) – price: $229.98, i.e. $230/16 per GB, $(230/16)K per TB, $(230/16)M per PB, $(230/16)/12 per PB-month.
SSD: 512GB – price $579.99, i.e. $580/512 per GB, $(580/512)K per TB, $(580/512)M per PB, $(580/512)/12 per PB-month.

Conclusion

Since RAM-based storage is up to 50 times faster than SSD (latency-wise) but only roughly 4.3 to 12 times more expensive than SSD it is likely to become high on the agenda in settings where latency matter$ (all types of serving infrastructure, search, finance etc.). In absolute terms the costs for petabytes RAM have become within reach for all Fortune 1000 companies, i.e. about $1.1M per month for the storage alone (ECC RAM). One interesting thing about using RAM only is that for most systems using SSD or Disks there is also a big RAM component in addition, e.g. using memcached or caches various nosql storages, and by moving to RAM-only things might become simpler (i.e. avoiding dealing with memory-vs-disk/ssd-coherency and latency variations when not hitting the memory cache).

Note 1: If you have other sources for interesting large-scale RAM and SSD prices I would appreciate if you could add links to them in the comments below.

Note 2: If you’re interested in large-scale RAM-based key-value stores, check out our opensource project Atbr – github page: https://github.com/atbrox/atbr

Best regards,

Amund Tveit co-founder of Atbrox (@atbrox)

Posted in cloud computing | Tagged , , , , | 3 Comments

atbr now has Apache Thrift support

atbr

atbr (large-scale and low-latency in-memory key-value pair store) now supports Apache Thrift for easier integration with other Hadoop services.

Thrift Example

Checkout and install atbr

$ git clone git@github.com:atbrox/atbr.git
$ cd atbr
$ sudo ./INSTALL.sh

Prerequisite Install/compile Apache Thrift – http://thrift.apache.org/

Compile a atbr thrift server and connect using python client

$ cd atbrthrift
$ make
$ ./atbr_thrift_server # c++ server
$ python test_atbr_thrift_client.py 

Python thrift api example

from atbr_thrift_client import connect_to_atbr_thrift_service
service = connect_to_atbr_thrift_service("localhost", "9090")
service.load("keyvaluedata.tsv")
value = service.get("key1")

Stay tuned for other updates on atbr.

Rough roadmap

  • Increased concurrency and threadsafety support
  • Increased reliability in sharded deployments (with Apache Zookeeper)
  • Simplified and automated sharded deployment on AWS and clusters
  • Benchmarks
  • Comparison with other storage alternative (e.g. HBase, Redis, MongoDB, CouchDB and Cassandra)
  • End-to-end examples (from hadoop/mapreduce jobs to serving)
  • (in-memory) map(reduce) support with Lua or C++
  • Avro support
  • large-scale graph processing example (ref: NetworkX)
  • Case studies
  • Add support for Judy Datastructure
  • Thrift-support (done)
  • Sharded websocket support (done) [blog post]
  • Memory-efficient key-value store (done) [blog post]

Documentation
atbr.atbrox.com

Best regards,

Amund Tveit (@atveit)
Atbrox

Posted in cloud computing | Tagged , , , , | Leave a comment

atbr – supports websocket-based sharding

atbr (large-scale and low-latency in-memory key-value pair store) now supports websocket-based sharding for parallel deployments.

Websocket Sharding Example

Checkout and install atbr

$ git clone git@github.com:atbrox/atbr.git
$ cd atbr
$ sudo ./INSTALL.sh

Start 3 servers loaded with data

$ cd atbrserver
$ python atbr_server.py 8585 shard_data_1.tsv
$ python atbr_server.py 8686 shard_data_2.tsv
$ python atbr_server.py 8787 shard_data_3.tsv

Start shard server talking to shards

  
$ python atbr_shard_server.py localhost:8585 \
          localhost:8686 localhost:8787

Connect to shard server and lookup key=key1

$ python atbr_websocket_cmdline_client.py key1

Stay tuned for other updates on atbr, here is a rough roadmap.

  • Increased concurrency and threadsafety support
  • Increased reliability in sharded deployments (with Apache Zookeeper)
  • Simplified and automated sharded deployment on AWS and clusters
  • Benchmarks
  • Comparison with other storage alternative (e.g. HBase, Redis, MongoDB, CouchDB and Cassandra)
  • End-to-end examples (from hadoop/mapreduce jobs to serving)
  • (in-memory) map(reduce) support with Lua or C++
  • Thrift support
  • Avro support
  • large-scale graph processing example (ref: NetworkX)
  • Case studies

Documentation
atbr.atbrox.com

Best regards,

Amund Tveit (@atveit)
Atbrox

Posted in cloud computing | Tagged , , , , , , , | 1 Comment

atbr – large-scale in-memory hashtables (in Python)

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

prerequisites:

a) install google sparsehash (and densehash)

wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install

b) install swig

c) compile atbr

make # creates _atbr.so and atbr.py ready to be used from python

python-api example

import atbr

# Create storage
mystore = atbr.Atbr()

# Load data
mystore.load("keyvaluedata.tsv")

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv
 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?

https://github.com/atbrox/atbr

Best regards,

Amund Tveit
Atbrox

Posted in cloud computing | Tagged , , , | 5 Comments

Monodroid with Sencha Touch for App Development

Selecting development tools for an app depend on several criteria, e.g.

  1. Should the app run on multiple mobile/tablet/pad platforms? (e.g. Android, iOS, Windows 8 etc.)
  2. Access to native features
  3. Is the UI “game like” (e.g. custom graphics) or “business like” (e.g. standardized forms/input fields etc.)

Mono – the opensource version of C#/.net platform – seems like an interesting platform wrt 1. multiplatform mobile support, i.e. Monodroid for Android, Monotouch for iOS and C# for Windows 8, and regarding 2. both Monodroid and Monotouch claim full access to all native APIs. Regarding type of UI – 3. “game like” UIs seems most efficiently developed using e.g. Lua-based Corona SDK, but for “business like” UIs html 5 seems to be a productive direction, e.g. with Sencha Touch combined with Phonegap

using System;
using Android.App;
using Android.Content;
using Android.Runtime;
using Android.Views;
using Android.Graphics;
using Android.Webkit; // WebView
using Android.Widget;
using Android.OS;

namespace testulf
{
	[Activity (Label = "testulf", MainLauncher = true)]
	public class Activity1 : Activity
	{
		protected override void OnCreate (Bundle bundle)
		{
		    base.OnCreate (bundle);
		    
		    var layout = new LinearLayout(ApplicationContext);
		    var webView = new WebView(ApplicationContext);
		    layout.AddView(webView);
		    SetContentView(layout);
		    webView.Settings.JavaScriptEnabled = true;
		    webView.SetWebViewClient(new WebViewClient()); 
		    webView.SetBackgroundColor(0);
		    webView.LoadUrl("file:///android_asset/overlays/index.html");
		}
	}
}

Monodroid with Sencha Touch for App Development

But when needing other native mobile features than what Phonegap and Sencha Touch has to offer but still wants the html 5 based UIs combining Monodroid (or Monotouch) for the native features and Sencha Touch for everything else seems like a potential idea.

This shows roughly how to integrate the two:

  1. Download and install Monodroid and Sencha
  2. Create a new Mono for Android Application (from Monodevelop), you now get a working example that can be deployed to the Android simulator
  3. Replace the content of Activity1.cs in the generated example with the C# code below (note: replace ‘testulf’ with your project name), it opens a WebVeiw with Sencha content.
  4. Slightly alter the Sencha Touch overlays example (the one I tested with Sencha Touch 1.1.1) by copying sencha-touch.js and sencha-touch.css to the overlays directory and updating references to them in the of index.html
  5. Add the overlays example to Assets (from Monodevelop, right click on Assets)
  6. Compile and run on simulator.

Just running Sencha Touch in a browser can be useful, but even more useful is communicating from Sencha back to Monodroid, this can be done by starting a small web server from Monodroid similar to as suggested by James Hughes in “Rolling Your Own PhoneGap with MonoTouch”

Best regards,
Amund Tveit (Atbrox)


Disclaimer: our blog have mainly covered “big data” (e.g. hadoop and search), but since small (touch) screens are increasingly important as generator of and consumer of big data – and for us – we make (at least) this blog post exception.

Posted in Android, iOS, mobile, Windows 8 | Tagged , , , | Leave a comment