atbr now has Apache Thrift support

atbr

atbr (large-scale and low-latency in-memory key-value pair store) now supports Apache Thrift for easier integration with other Hadoop services.

Thrift Example

Checkout and install atbr
[code]
$ git clone git@github.com:atbrox/atbr.git
$ cd atbr
$ sudo ./INSTALL.sh
[/code]

Prerequisite Install/compile Apache Thrift – http://thrift.apache.org/

Compile a atbr thrift server and connect using python client
[code]
$ cd atbrthrift
$ make
$ ./atbr_thrift_server # c++ server
$ python test_atbr_thrift_client.py
[/code]

Python thrift api example
[code]
from atbr_thrift_client import connect_to_atbr_thrift_service
service = connect_to_atbr_thrift_service("localhost", "9090")
service.load("keyvaluedata.tsv")
value = service.get("key1")
[/code]

Stay tuned for other updates on atbr.

Rough roadmap

  • Increased concurrency and threadsafety support
  • Increased reliability in sharded deployments (with Apache Zookeeper)
  • Simplified and automated sharded deployment on AWS and clusters
  • Benchmarks
  • Comparison with other storage alternative (e.g. HBase, Redis, MongoDB, CouchDB and Cassandra)
  • End-to-end examples (from hadoop/mapreduce jobs to serving)
  • (in-memory) map(reduce) support with Lua or C++
  • Avro support
  • large-scale graph processing example (ref: NetworkX)
  • Case studies
  • Add support for Judy Datastructure
  • Thrift-support (done)
  • Sharded websocket support (done) [blog post]
  • Memory-efficient key-value store (done) [blog post]

Documentation
atbr.atbrox.com

Best regards,

Amund Tveit (@atveit)
Atbrox

Posted in cloud computing | Tagged , , , , | Leave a comment

atbr – supports websocket-based sharding

atbr (large-scale and low-latency in-memory key-value pair store) now supports websocket-based sharding for parallel deployments.

Websocket Sharding Example

Checkout and install atbr
[code]
$ git clone git@github.com:atbrox/atbr.git
$ cd atbr
$ sudo ./INSTALL.sh
[/code]

Start 3 servers loaded with data
[code]
$ cd atbrserver
$ python atbr_server.py 8585 shard_data_1.tsv
$ python atbr_server.py 8686 shard_data_2.tsv
$ python atbr_server.py 8787 shard_data_3.tsv
[/code]

Start shard server talking to shards
[code]
$ python atbr_shard_server.py localhost:8585 \
localhost:8686 localhost:8787
[/code]

Connect to shard server and lookup key=key1
[code]
$ python atbr_websocket_cmdline_client.py key1
[/code]

Stay tuned for other updates on atbr, here is a rough roadmap.

  • Increased concurrency and threadsafety support
  • Increased reliability in sharded deployments (with Apache Zookeeper)
  • Simplified and automated sharded deployment on AWS and clusters
  • Benchmarks
  • Comparison with other storage alternative (e.g. HBase, Redis, MongoDB, CouchDB and Cassandra)
  • End-to-end examples (from hadoop/mapreduce jobs to serving)
  • (in-memory) map(reduce) support with Lua or C++
  • Thrift support
  • Avro support
  • large-scale graph processing example (ref: NetworkX)
  • Case studies

Documentation
atbr.atbrox.com

Best regards,

Amund Tveit (@atveit)
Atbrox

Posted in cloud computing | Tagged , , , , , , , | 1 Comment

atbr – large-scale in-memory hashtables (in Python)

atbr logo

Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch
Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

prerequisites:

a) install google sparsehash (and densehash)

[code]
wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
tar -zxvf sparsehash-2.0.2.tar.gz
cd sparsehash-2.0.2
./configure && make && make install
[/code]

b) install swig

c) compile atbr

[code]make # creates _atbr.so and atbr.py ready to be used from python[/code]

python-api example

[code]
import atbr

# Create storage
mystore = atbr.Atbr()

# Load data
mystore.load("keyvaluedata.tsv")

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")
[/code]

benchmark (loading)
Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

[code]
$ ls -al medium.tsv
-rw-r–r– 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
[/code]
[code]
$ wc medium.tsv
212969 5835001 117362571 medium.tsv
[/code]
[code]
$ python
>>> import atbr
>>> a = atbr.Atbr()
>>> a.load("medium.tsv")
Inserting took – 1.178468 seconds
Num new key-value pairs = 212969
Speed: 180716.807959 key-value pairs per second
Throughput: 94.803214 MB per second
[/code]

Possible road ahead?
1) integrate with tornado, to get websocket and http API
2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?

https://github.com/atbrox/atbr

Best regards,

Amund Tveit
Atbrox

Posted in cloud computing | Tagged , , , | 5 Comments

Monodroid with Sencha Touch for App Development

Selecting development tools for an app depend on several criteria, e.g.

  1. Should the app run on multiple mobile/tablet/pad platforms? (e.g. Android, iOS, Windows 8 etc.)
  2. Access to native features
  3. Is the UI “game like” (e.g. custom graphics) or “business like” (e.g. standardized forms/input fields etc.)

Mono – the opensource version of C#/.net platform – seems like an interesting platform wrt 1. multiplatform mobile support, i.e. Monodroid for Android, Monotouch for iOS and C# for Windows 8, and regarding 2. both Monodroid and Monotouch claim full access to all native APIs. Regarding type of UI – 3. “game like” UIs seems most efficiently developed using e.g. Lua-based Corona SDK, but for “business like” UIs html 5 seems to be a productive direction, e.g. with Sencha Touch combined with Phonegap

[csharp]
using System;
using Android.App;
using Android.Content;
using Android.Runtime;
using Android.Views;
using Android.Graphics;
using Android.Webkit; // WebView
using Android.Widget;
using Android.OS;

namespace testulf
{
[Activity (Label = "testulf", MainLauncher = true)]
public class Activity1 : Activity
{
protected override void OnCreate (Bundle bundle)
{
base.OnCreate (bundle);

var layout = new LinearLayout(ApplicationContext);
var webView = new WebView(ApplicationContext);
layout.AddView(webView);
SetContentView(layout);
webView.Settings.JavaScriptEnabled = true;
webView.SetWebViewClient(new WebViewClient());
webView.SetBackgroundColor(0);
webView.LoadUrl("file:///android_asset/overlays/index.html");
}
}
}
[/csharp]

Monodroid with Sencha Touch for App Development

But when needing other native mobile features than what Phonegap and Sencha Touch has to offer but still wants the html 5 based UIs combining Monodroid (or Monotouch) for the native features and Sencha Touch for everything else seems like a potential idea.

This shows roughly how to integrate the two:

  1. Download and install Monodroid and Sencha
  2. Create a new Mono for Android Application (from Monodevelop), you now get a working example that can be deployed to the Android simulator
  3. Replace the content of Activity1.cs in the generated example with the C# code below (note: replace ‘testulf’ with your project name), it opens a WebVeiw with Sencha content.
  4. Slightly alter the Sencha Touch overlays example (the one I tested with Sencha Touch 1.1.1) by copying sencha-touch.js and sencha-touch.css to the overlays directory and updating references to them in the of index.html
  5. Add the overlays example to Assets (from Monodevelop, right click on Assets)
  6. Compile and run on simulator.

Just running Sencha Touch in a browser can be useful, but even more useful is communicating from Sencha back to Monodroid, this can be done by starting a small web server from Monodroid similar to as suggested by James Hughes in “Rolling Your Own PhoneGap with MonoTouch”

Best regards,
Amund Tveit (Atbrox)


Disclaimer: our blog have mainly covered “big data” (e.g. hadoop and search), but since small (touch) screens are increasingly important as generator of and consumer of big data – and for us – we make (at least) this blog post exception.

Posted in Android, iOS, mobile, Windows 8 | Tagged , , , | Leave a comment

Distributed Tracer Bullet Development

Tracer Bullet Development


Tracer Bullet Development is finding the major “moving parts” of a software system and start by writing enough code to make those parts interact in a real manner (e.g. with direct API-calls, websocket or REST-APIs), and as the system grows (with actual functionality and not just interaction) keep the “tracer ammunition” flowing through the system by changing the internal interaction APIs (only) if needed.

Motivation for Tracer Bullet Development

  1. integration is the hardest word (paraphrase of an old tune)
  2. prevent future integration problems (working internal APIs from the start)
  3. have a working system at all times (though limited in the beginning)
  4. create non-overlapping tasks for software engineers (good management)

(Check out the book: Ship it! A Practical Guide to Successful Software Projects for details about this method)

Examples of Distributed Tracer Bullet Development

Let us assume you got a team of 10 excellent software engineers who had never worked together before, and simultaneously the task of creating an first version and working distributed (backend) system within a short period of time?

How would you solve the project and efficiently utilize all the developers? (i.e. no time for meet&greet offsite)

Splitting the work properly with tracer bullet development could be a start, let’s look at how it could be done for a few examples:

1. Massively Multiplayer Online Games
Massively Multiplayer Online Games, e.g. Zynga’s Farmville, Linden Lab’s Second Life, and BioWare/LucasArt’s Star Wars Old Republic – are complex distributed systems. So what can a high-level tracer bullet architecture for such a game look like? Services that might be needed are:

  1. GameWorldService – to deal with the game world, assuming basic function is returning a graphic tile for a position x, y, z
  2. GameArtifactService – to deal with state of various “things” in the world (e.g. weapons/utilities), e.g. growth of plants.
  3. GameEconomyService – to deal with overall in-game economy and trade
  4. AvatarService – to deal with player avatars and non-player characters (monsters/bots) (i.e. active entities that operate in the GameWorldService and can alter the GameArtifactService)
  5. LogService – to log what happens in the game
  6. StatService – calculates/monitors various statistics about the game
  7. AIService – e.g. used by non-player characters for reasoning
  8. UserService – to deal with users (profiles, login/passwords etc, metainfo++)
  9. GameStateService – overall game state
  10. ChatService – for interaction between players
  11. ClientService – to deal with various software clients users use, e.g. ipad client, pc client
  12. CheatMalwareDetectionService – always someone looking to exploit a game
    UserService (to deal with state/metainfo regarding the user),

Already more services (12) than software engineers (10), but let us create a beginning of a draft of at tracer bullet definition in a json-like manner.
[sourcecode language=”javascript” wraplines=”false” collapse=”false”]
tracerbullets = {
"GameWorldService":{
"dependencies":["GameStateService","LogService"],
"defaultresponse":{"tiledata_as_json_base_64": ".."},
"loadbalancedserveraddress":"gameworld.gamecomp.com"},

"GameArtifactService":{
"dependencies":["GameStateService","GameWorldService"],
"defaultresponse":{"artifactinfo": ".."}
"loadbalancedserveraddress":"gameartifacts.gamecomp.com"},

"AvatarService":{
},

"GameEconomyService":{
}
"
}
[/sourcecode]

Atbrox’ (internal) Tracer Bullet Development Tool – BabelShark
The game example resembles RPC (e.g. Avro) and various deployment type definitions (e.g. Chef, or Puppet) but it focused on specifying enough information (but not more) to get the entire (empty, with default responses) system up and running with it’s approriate host names (which can be run on one machine for testing with either minor /etc/hostname file changes or running a local dns server). When the system is running each request appends the received default responses to its default response so one can trace the path of e.g. REST/HTTP or websocket calls through the system (e.g. if a call to the GameWorldService uses both GameStateService and LogService as below, this will be shown in the resulting json from GameWorldService). When the (mock-like) default responses are gradually being replaced with real services they can be run as before, and when they are properly deployed just removing the DNS entry in /etc/hosts or the local dns server to get real data. Proxying external services (e.g. Amazon Web Services) can be done in a similar manner. This can in overall make it easier to bridge development situation with deployment situation.

In Atbrox we have an internal tool called BabelShark that takes an input tracer bullet definition (json) and creates Python-based tracer bullet system code (using Bret Taylor’s Tornado websocket support) and also creates corresponding websocket commandline client and javascript/html clients for ease of testing all components in the system. Technically it spawns one tornado process per service (or per instance of a service if more than one), dynamically finds available port numbers and communicates them back to , creates a new /etc/hosts file with the requested host names per service (all pointing to localhost), and a kill-shell-script file (note: you quickly get a lot of processes this way, so even if the multicores are humming you can quickly overflow them, so nice to be able to kill them).

Example 2. Search Query Handling
The prerequisite for search is the query, and a key task is to (quickly) understand the user’s intention with the query (before actually doing anything with the query, such as looking up results in an index).

A few questions needs to be answered about the query: 1) what is the language of query?, 2) is the query spelled correctly in the given language? 3) what is the meaning of the query? 4) does the query have ambiguous meaning (either wrt language or interpretation), 5) what is the most likely meaning among the most ambiguous ones? So how can a tracer-bullet development for this look like?

[sourcecode language=”javascript” wraplines=”false” collapse=”false”]
tracerbullets = {
"LanguageDeterminator":{
"dependencies":["KlingonClassifier", "EnglishClassifier"],
"defaultresponse":{"sortedlanguagesprobabilities":[{1.0:"English"}]}
},

"SpellingIsCorrect":{
"dependencies":["LanguageDeterminator","KlingonSpellChecker"],
"defaultresponse":{"isitspelledcorrectly":"yes"}
},

"MeaningDetermination":{
"dependencies":["LanguageDeterminator", "NameEntityDeterminator"],
"defaultresponse":{"meaning":"just a string with no entities"},
},

"Disambiguator": {
"dependencies":["MeaningDetermination", ".."],
// specialized for the query: Turkey – is it about the country or
// or about food (i.e. right before thanksgiving)
"defaultresponse":{
"disambiguatedprobabitity":[{0.9:"country"},{0.1:"bird"}]
}
}

}
[/sourcecode]

Conclusion

Have given an overview of tracer bullet development for a couple of distributed system cases, and have also mentioned how our (internal) tool supports Distributed Tracer Bullet Development.

If you are keen to learn more and work with us here in Atbrox, please check out our jobs page. Atbrox is a bootstrapped startup working on big data (e.g. hadoop and mapreduce) and search (we also work with and own parts of a few other tech startups).

Best regards,

Amund Tveit (@atveit)
Atbrox (@atbrox)

Posted in information retrieval, infrastructure, tracer bullet development | Tagged , , | Leave a comment