Apr 02

Memkite – a mobile/wearable search startup that Atbrox has helped foster – describes technical feasibility of building Hitchhiker’s Guide to the Galaxy, check out the blog post at:


Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Aug 26

Atbrox is one of 8 European partners in a research project on cloud computing. This is a great opportunity for us to learn how and help out in making cloud computing more efficient.

Underneath is a translation of the leading partner’s – Department of Computer Science (IFI), The Faculty of Mathematics and Natural Sciences, University of Oslo – description of the project:

The planet’s data storage and processing is about to move up in the clouds. Sharing and rental of computing resources across geographic boundaries creates new opportunities, especially for companies who can now access the computing power they couldn’t previously afford.

Professor Einar Broch Johnsen at IFI has received financial support from the European Commission to conduct a research project to make the transition to the cloud more attractive, especially for industry. The main advantage of cloud-driven computing is to use and pay for what you need. But how a business can predict and estimate the resources used in the design phase of a project is not nearly well enough developed, which can easily lead to bad miscalculations. This will ENVISAGE try to change. ENVISAGE project has eight partners in five countries and has as main objective to facilitate the development of virtualized services. By building parts of the legal basis of the service agreement between the customer and the provider into the system, the customer / business easier to fine-tune their consumption and thereby, i.e. save time and money. Potential users for ENVISAGEs technology are companies that develop software. The technology will giving them the opportunity to improve utilization of cloud resources. The benefits of this are obvious, and being at the forefront of this development project hopes to help businesses can improve profitability significantly. ENVISAGE will run until autumn 2016 and is funded through the EU 7th Framework.

Best regards,
Amund Tveit

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
May 28

The purpose of Continuous Deployment is to increase Quality and Efficiency,
see e.g. The Software Revolution behind Linkedin’t Gushing Profits or read on

This posting presents an overview of Atbrox’ ongoing work on Automated Continuous Deployment. We develop in several languages depending on project or product, e.g. C/C++ (typically with SWIG combined with Python, or combined with Objective C), C# , Java (typically Hadoop/Mapreduce-related) and Objective-C (iOS). But most of our code is in Python (together with HTML/Javascript for frontends and APIs) and this posting will primarily show Python-centric continuous deployment with Jenkins (total flow) and also some more detail on the testing Tornado apps with Selenium.

Continuous Deployment of a Python-based Web Service / API

Many of the projects we develop involve creating a HTTP/REST or websocket API that generically said “does something with data” and has a corresponding UI in Javascript/HTML. The typical building stones of such a service is shown in the figure:

The flow is roughly as follows

  1. An Atbrox developer submits code into a git repo (e.g. Bitbucket.org or Github.com repo)
  2. Jenkins picks up the change (by notification from git or by polling)
  3. Tests are run, e.g.
    py.test -v --junitxml=result.xml --cov-report html --cov-report xml --cov .
    1. Traditional Python unit tests
    2. Tornado web app asynchronous tests – http://www.tornadoweb.org/en/stable/testing.html
    3. Selenium UI Tests (e.g. with PhantomJS or xvfb/pyvirtualdisplay)
    4. Various metrics, e.g. test coverage, lines of code (sloccount), code duplication (PMD) and static analysis (e.g. pylint or pychecker)
  4. If tests and metrics are ok:
    1. provision cloud virtual machines (currently AWS EC2) if needed with fabric and boto, e.g.
      fab service launch
    2. deploy to provisioned or existing machines with fabric and chef (solo), e.g.
      fab service deploy
  5. Fortunately Happy customer (and atbrox developer). Goto 1.

Example of selenium test of Tornado Web Apps with PhantomJS

Tornado is a python-based app server that supports Websocket and HTTP (it was originally developed by Bret Taylor while he was a FriendFeed). In addition to the python-based tornado apps you typically write a mix of javascript code and html templates for the frontend. The following example shows how to selenium tests for Tornado can be run:

Utility methods for starting a Tornado application and pick a port for it

import os
import tornado.ioloop
import tornado.httpserver
import multiprocessing

def create_process(port, queue, boot_function, application, name, 
                    instance_number, service, 
    p = processor.Process(target=boot_function, 
                          args=(queue, port, 
                               application, name,
                               instance_number, service))
    return p

def start_application_server(queue, port, application, name, 
                             instance_number, service):
    http_server = tornado.httpserver.HTTPServer(application)
    actual_port = port
    if port == 0: # special case, an available port is picked automatically
        # only pick first! (for now)
        assert len(http_server._sockets) > 0
        for s in http_server._sockets:
            actual_port = http_server._sockets[s].getsockname()[1]
    pid = os.getpid()
    ppid = os.getppid()
    print "INTERNAL: actual_port = ", actual_port
    info = {"name":name, "instance_number": instance_number, 
            "ppid": ppid, 

Example Tornado HTTP Application Class with an HTML form

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        html = """
<head><title>form title</title></head>
<form name="input" action="http://localhost" method="post" id="formid">
Query: <input type="text" name="query" id="myquery">
<input type="submit" value="Submit" id="mybutton">

    def post(self):
        self.write("post returned")

Selenium unit test for the above Tornado class

class MainHandlerTest(unittest.TestCase):                                                                                        
    def setUp(self):                                                                                                             
        self.application = tornado.web.Application([                                                                             
            (r"/", MainHandler),                                                                                                 
        self.queue = multiprocessing.Queue()                                                                                                                                                                                                        
        self.server_process = create_process(0,self.queue,start_application_server,self.application,"mainapp", 123, "myservice") 
        self.driver = webdriver.PhantomJS('/usr/local/bin/phantomjs')                                                            
    def testFormSubmit(self):                                                                                                    
        data = self.queue.get()                                                                                                  
        URL = "http://localhost:%s" % (data['port'])                                                                             
        self.driver.get('http://localhost:%s' % (data['port']))                                                                  
        assert "form title" in self.driver.title                                                                                 
        element = self.driver.find_element_by_id("formid")      
        # since port is dynamically assigned it needs to be updated with the port in order to work                                                         
        self.driver.execute_script("document.getElementById('formid').action='http://localhost:%s'" % (data['port']))            
        # send click to form and receive result??                                                                                
        self.driver.find_element_by_id("myquery").send_keys("a random query")                                                    
        assert 'post returned' in self.driver.page_source                                                                        
    def tearDown(self):                                                                                                          
if __name__ == "__main__":                                                                                                       

The posting have given and overview of Atbrox’ (in-progress) Python-centric continuous deployment setup, with some more details how to do testing of Tornado web apps with Selenium. There are lots of inspirational and relatively recent articles and presentations about continuous deployment, in particular we recommend you to check out:

  1. Etsy’s slideshare about continuous deployment and delivery
  2. the Wired article about The Software Revolution Behind LinkedIn’s Gushing Profits
  3. Continuous Deployment at Quora

Please let us know if you have any comments or questions (comments to this blog post or mail to info@atbrox.com)

Best regards,
The Atbrox Team

Side note: We’re proponents and bullish of Python and it is inspirational to observe the trend that several major Internet/Mobile startups/companies are using it for their backend development, e.g. Instagram, Path, Quora, Pinterest, Reddit, Disqus, Mozilla and Dropbox. The largest python-based backends probably serve more traffic than 99.9% of the world’s web and mobile sites, and that is usually sufficient capability for most projects.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
May 16

This posting is a follow-up to the large-scale low-latency (RAM-based) storage related price estimates in my previous posting Main takeaways from Accel’s Big Data Conference.

Assume you were to store and index large amounts of social network updates in-memory, e.g. tweets.

1) fetch some tweets

curl https://stream.twitter.com/1/statuses/sample.json?delimited=length -uAnyTwitterUser:Password > yourfilename

2) gather some stats about tweets

import json
import zlib

all_tokens = []
num_kept = 0
uncompressed_lengths = [] 
compressed_lengths = [] 
num_tokens_per_tweet = [] 
num_unique_tokens_per_tweet = []
num_kept_tweets = 0
all_tokens = []

for line in file('yourfilename'):
    # skip non-json lines returned by APIs (lengths)
    if not line.startswith("{"):

    jline = json.loads(line)

    text = jline.get("text", " ").lower()

    # skips - for simplicity - tweets that can't be space-tokenized
    if not " " in text: 

    # tweets with metadata
    # token calculations
    tokens = text.split(" ")
    token_lengths = [len(token) for token in tokens]

    num_kept_tweets += 1

avg_uncompressed_length = (sum(uncompressed_lengths)+0.0)/num_kept_tweets
avg_compressed_length = (sum(compressed_lengths)+0.0)/num_kept_tweets
avg_num_tokens = (sum(num_tokens_per_tweet)+0.0)/num_kept_tweets
avg_num_unique_tokens = (sum(num_unique_tokens_per_tweet)+0.0)/num_kept_tweets
avg_token_length = (sum(all_tokens)+0.0)/len(all_tokens)

print "average uncompressed length = ", avg_uncompressed_length
print "average compressed length = ", avg_compressed_length
print "average num tokens = ", avg_num_tokens
print "average num unique tokens = ", avg_num_unique_tokens
print "average token length = ", avg_token_length
print "number of tweets = ", num_kept_tweets

Output for my small random tweet sample

average uncompressed length =  2099.60084926
average compressed length =  848.08492569
average num tokens =  8.91507430998
average num unique tokens =  8.33121019108
average token length =  5.44582043344
number of tweets =  471

Calculate based on published amounts of tweets – 340M tweets per day, ref: thenextweb.

num_tweets_per_day = 340000000
one_gigabyte = 1024*1024*1024
keysize = 64/8 # 64 bit keys

hash_overhead = 2.0/8 # 2 bit overhead, assuming memory-efficient hashtable

storage_per_day_in_gigabytes = num_tweets_per_day*avg_compressed_length/one_gigabyte + num_tweets_per_day*(keysize+hash_overhead)/one_gigabyte

ram_cost_kUSD_per_petabyte_month = 1197
ram_cost_kUSD_per_terabyte_month = ram_cost_kUSD_per_petabyte_month/1000.0
ram_cost_USD_per_terabyte_day = 1000*ram_cost_kUSD_per_terabyte_month/31

storage_per_day_in_terabytes = storage_per_day_in_gigabytes/1024.0
storage_per_week_in_terabytes = 7*storage_per_day_in_terabytes
storage_per_month_in_terabytes = 31*storage_per_day_in_terabytes
storage_per_year_in_terabytes = 365*storage_per_day_in_terabytes

print "storage per day in TB = %f - RAM-cost (per day) %f USD" % (storage_per_day_in_terabytes, storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day)
print "storage per week in TB = %d - RAM-cost (per day) %f kUSD" % (storage_per_week_in_terabytes, 7*storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "storage per month in TB = %d - RAM-cost (per day) %f kUSD - RAM-cost (per year) %f Million USD" % (storage_per_month_in_terabytes, storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000, 365*storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000))
print "storage per year in TB = %d - RAM cost (per day) %f kUSD - RAM cost (per year) %f Million USD" % (storage_per_year_in_terabytes, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000)*365)

Output (based on estimates from my small random tweet sample)

storage per day in TB = 0.264803 - RAM-cost (per day) 10.224809 USD
storage per week in TB = 1 - RAM-cost (per day) 0.071574 kUSD
storage per month in TB = 8 - RAM-cost (per day) 0.316969 kUSD - RAM-cost (per year) 0.115694 Million USD
storage per year in TB = 96 - RAM cost (per day) 3.732055 kUSD - RAM cost (per year) 1.362200 Million USD

3. Index calculations (upper bound)

# (extremely naive/stupid/easy-to-estimate-with) assumptions:
#   see e.g. http://cis.poly.edu/~hyan/sigIR-position.pdf for more realistic representations
# 1) all the unique terms of all single tweets does not occur in other tweets
# 2) there are now new terms from one day to another
#    i.e. the posting list per term increases in average by 1 (64 bit tweet id) every day)
# 3) the posting lists are not compressed, i.e. storing 64 bit per list entry
# 4) token themselves are keys
# 5) no ranking/metadata/ngrams etc. for the index
token_key_overhead = 2.0/8
num_tokens_in_index = num_tweets_per_day*avg_num_unique_tokens

# each tweet provides an update to avg_num_unique_tokens entries in index

key_contribution = num_tokens_in_index*(avg_token_length + token_key_overhead)

index_size_per_day = key_contribution + num_tweets_per_day*avg_num_unique_tokens*64/8
index_size_per_week = key_contribution + num_tweets_per_day*avg_num_unique_tokens*7*64/8
index_size_per_month = key_contribution + num_tweets_per_day*avg_num_unique_tokens*31*64/8
index_size_per_year = key_contribution + num_tweets_per_day*avg_num_unique_tokens*365*64/8

index_size_per_day_in_terabytes = index_size_per_day/(1024*1024*1024)
index_size_per_week_in_terabytes = index_size_per_week/(1024*1024*1024)
index_size_per_month_in_terabytes = index_size_per_month/(1024*1024*1024)
index_size_per_year_in_terabytes = index_size_per_year/(1024*1024*1024)

# assuming slightly better encoding of posting lists, e.g. average of 1 byte per entry would give
better_encoded = index_size_per_year_in_terabytes/8

print "index size per week in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_week_in_terabytes, index_size_per_week_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "index size per month in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_month_in_terabytes, index_size_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "index size per year in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_year_in_terabytes, index_size_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000)

print "index size per year in terabytes (better encoding) = %f - RAM-cost (per day) %f kUSD" % (better_encoded, better_encoded*ram_cost_USD_per_terabyte_day/1000)

Index estimate outputs

index size per week in terabytes = 162.758202 - RAM-cost (per day) 6.284567 kUSD
index size per month in terabytes = 669.268602 - RAM-cost (per day) 25.842404 kUSD
index size per year in terabytes = 7718.205009 - RAM-cost (per day) 298.022303 kUSD
index size per year in terabytes (better encoding) = 964.775626 - RAM-cost (per day) 37.252788 kUSD

Keeping 1 year worth of tweets (including metadata) and (a crude) index of them in-memory is costly, but not too bad. I.e. 1.36 Million USD to keep 1 years worth of tweets (124 billion tweets) for 1 year in an (distributed) in-memory hashtable (or the same amount of tweets stored in the same hashtable for one day costs approximately 3732 USD). The index size estimates are very rough (check out this paper for more realistic representations). The energy costs (to maintain and refresh the RAM) would add between 5-25% additional costs (see comments on previous blog post).

Q: So, is it time to reconsider using hard drives and SSDs and consider going for RAM instead
A: yes, at least consider it and combine with Hadoop. Check out Stanford’s RAMCloud project, and their paper: The Case for RAMClouds:
Scalable High-Performance Storage Entirely in DRAM
. There is still plenty of room for innovation for very-large-scale in-memory systems – there are some commercial vendors support systems with low-terabyte amounts of RAM (e.g. Teradata and Exalytics), but no (easily) available open source or commercial software support Petabyte-size RAM amounts.

On a related note:

disclaimer: this posting have quite a few numbers, so the likelihood of errors is > 0, please let me know if you spot one.

Interested in large-scale in-memory key-value stores?
Check out atbrhttp://github.com/atbrox/atbr

Source code for this posting?

Best regards,

Amund Tveit, co-founder of Atbrox

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Jan 24

Tracer Bullet Development

Tracer Bullet Development is finding the major “moving parts” of a software system and start by writing enough code to make those parts interact in a real manner (e.g. with direct API-calls, websocket or REST-APIs), and as the system grows (with actual functionality and not just interaction) keep the “tracer ammunition” flowing through the system by changing the internal interaction APIs (only) if needed.

Motivation for Tracer Bullet Development

  1. integration is the hardest word (paraphrase of an old tune)
  2. prevent future integration problems (working internal APIs from the start)
  3. have a working system at all times (though limited in the beginning)
  4. create non-overlapping tasks for software engineers (good management)

(Check out the book: Ship it! A Practical Guide to Successful Software Projects for details about this method)

Examples of Distributed Tracer Bullet Development

Let us assume you got a team of 10 excellent software engineers who had never worked together before, and simultaneously the task of creating an first version and working distributed (backend) system within a short period of time?

How would you solve the project and efficiently utilize all the developers? (i.e. no time for meet&greet offsite)

Splitting the work properly with tracer bullet development could be a start, let’s look at how it could be done for a few examples:

1. Massively Multiplayer Online Games
Massively Multiplayer Online Games, e.g. Zynga’s Farmville, Linden Lab’s Second Life, and BioWare/LucasArt’s Star Wars Old Republic – are complex distributed systems. So what can a high-level tracer bullet architecture for such a game look like? Services that might be needed are:

  1. GameWorldService – to deal with the game world, assuming basic function is returning a graphic tile for a position x, y, z
  2. GameArtifactService – to deal with state of various “things” in the world (e.g. weapons/utilities), e.g. growth of plants.
  3. GameEconomyService – to deal with overall in-game economy and trade
  4. AvatarService – to deal with player avatars and non-player characters (monsters/bots) (i.e. active entities that operate in the GameWorldService and can alter the GameArtifactService)
  5. LogService – to log what happens in the game
  6. StatService – calculates/monitors various statistics about the game
  7. AIService – e.g. used by non-player characters for reasoning
  8. UserService – to deal with users (profiles, login/passwords etc, metainfo++)
  9. GameStateService – overall game state
  10. ChatService – for interaction between players
  11. ClientService – to deal with various software clients users use, e.g. ipad client, pc client
  12. CheatMalwareDetectionService – always someone looking to exploit a game
    UserService (to deal with state/metainfo regarding the user),

Already more services (12) than software engineers (10), but let us create a beginning of a draft of at tracer bullet definition in a json-like manner.

tracerbullets = {
  "defaultresponse":{"tiledata_as_json_base_64": ".."},

  "defaultresponse":{"artifactinfo": ".."}



Atbrox’ (internal) Tracer Bullet Development Tool – BabelShark
The game example resembles RPC (e.g. Avro) and various deployment type definitions (e.g. Chef, or Puppet) but it focused on specifying enough information (but not more) to get the entire (empty, with default responses) system up and running with it’s approriate host names (which can be run on one machine for testing with either minor /etc/hostname file changes or running a local dns server). When the system is running each request appends the received default responses to its default response so one can trace the path of e.g. REST/HTTP or websocket calls through the system (e.g. if a call to the GameWorldService uses both GameStateService and LogService as below, this will be shown in the resulting json from GameWorldService). When the (mock-like) default responses are gradually being replaced with real services they can be run as before, and when they are properly deployed just removing the DNS entry in /etc/hosts or the local dns server to get real data. Proxying external services (e.g. Amazon Web Services) can be done in a similar manner. This can in overall make it easier to bridge development situation with deployment situation.

In Atbrox we have an internal tool called BabelShark that takes an input tracer bullet definition (json) and creates Python-based tracer bullet system code (using Bret Taylor’s Tornado websocket support) and also creates corresponding websocket commandline client and javascript/html clients for ease of testing all components in the system. Technically it spawns one tornado process per service (or per instance of a service if more than one), dynamically finds available port numbers and communicates them back to , creates a new /etc/hosts file with the requested host names per service (all pointing to localhost), and a kill-shell-script file (note: you quickly get a lot of processes this way, so even if the multicores are humming you can quickly overflow them, so nice to be able to kill them).

Example 2. Search Query Handling
The prerequisite for search is the query, and a key task is to (quickly) understand the user’s intention with the query (before actually doing anything with the query, such as looking up results in an index).

A few questions needs to be answered about the query: 1) what is the language of query?, 2) is the query spelled correctly in the given language? 3) what is the meaning of the query? 4) does the query have ambiguous meaning (either wrt language or interpretation), 5) what is the most likely meaning among the most ambiguous ones? So how can a tracer-bullet development for this look like?

tracerbullets = {
  "dependencies":["KlingonClassifier", "EnglishClassifier"],


   "dependencies":["LanguageDeterminator", "NameEntityDeterminator"],
   "defaultresponse":{"meaning":"just a string with no entities"},

"Disambiguator": {
   "dependencies":["MeaningDetermination", ".."],
   // specialized for the query: Turkey - is it about the country or 
   // or about food (i.e. right before thanksgiving)



Have given an overview of tracer bullet development for a couple of distributed system cases, and have also mentioned how our (internal) tool supports Distributed Tracer Bullet Development.

If you are keen to learn more and work with us here in Atbrox, please check out our jobs page. Atbrox is a bootstrapped startup working on big data (e.g. hadoop and mapreduce) and search (we also work with and own parts of a few other tech startups).

Best regards,

Amund Tveit (@atveit)
Atbrox (@atbrox)

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload