My presentation held at O’Reilly Strata Conference in London, UK, October 1st 2012
Best regards,
Amund Tveit
My presentation held at O’Reilly Strata Conference in London, UK, October 1st 2012
Best regards,
Amund Tveit
Atbrox is participating and holding a Hadoop/Mapreduce algorithm related presentation at the O’Reilly Strata Conference in London October 1st and 2nd. If you are there and would like to meet Atbrox send an email to info@atbrox.com
Best regards,
Amund Tveit
This posting is a follow-up to the large-scale low-latency (RAM-based) storage related price estimates in my previous posting Main takeaways from Accel’s Big Data Conference.
Assume you were to store and index large amounts of social network updates in-memory, e.g. tweets.
1) fetch some tweets
curl https://stream.twitter.com/1/statuses/sample.json?delimited=length -uAnyTwitterUser:Password > yourfilename
2) gather some stats about tweets
import json import zlib all_tokens = [] num_kept = 0 uncompressed_lengths = [] compressed_lengths = [] num_tokens_per_tweet = [] num_unique_tokens_per_tweet = [] num_kept_tweets = 0 all_tokens = [] for line in file('yourfilename'): # skip non-json lines returned by APIs (lengths) if not line.startswith("{"): continue jline = json.loads(line) text = jline.get("text", " ").lower() # skips - for simplicity - tweets that can't be space-tokenized if not " " in text: continue # tweets with metadata uncompressed_lengths.append(len(line)) compressed_lengths.append(len(zlib.compress(line))) # token calculations tokens = text.split(" ") num_tokens_per_tweet.append(len(tokens)) num_unique_tokens_per_tweet.append(len(set(tokens))) token_lengths = [len(token) for token in tokens] all_tokens.extend(token_lengths) num_kept_tweets += 1 avg_uncompressed_length = (sum(uncompressed_lengths)+0.0)/num_kept_tweets avg_compressed_length = (sum(compressed_lengths)+0.0)/num_kept_tweets avg_num_tokens = (sum(num_tokens_per_tweet)+0.0)/num_kept_tweets avg_num_unique_tokens = (sum(num_unique_tokens_per_tweet)+0.0)/num_kept_tweets avg_token_length = (sum(all_tokens)+0.0)/len(all_tokens) print "average uncompressed length = ", avg_uncompressed_length print "average compressed length = ", avg_compressed_length print "average num tokens = ", avg_num_tokens print "average num unique tokens = ", avg_num_unique_tokens print "average token length = ", avg_token_length print "number of tweets = ", num_kept_tweetsOutput for my small random tweet sample
average uncompressed length = 2099.60084926 average compressed length = 848.08492569 average num tokens = 8.91507430998 average num unique tokens = 8.33121019108 average token length = 5.44582043344 number of tweets = 471Calculate based on published amounts of tweets – 340M tweets per day, ref: thenextweb.
num_tweets_per_day = 340000000 one_gigabyte = 1024*1024*1024 keysize = 64/8 # 64 bit keys hash_overhead = 2.0/8 # 2 bit overhead, assuming memory-efficient hashtable storage_per_day_in_gigabytes = num_tweets_per_day*avg_compressed_length/one_gigabyte + num_tweets_per_day*(keysize+hash_overhead)/one_gigabyte ram_cost_kUSD_per_petabyte_month = 1197 ram_cost_kUSD_per_terabyte_month = ram_cost_kUSD_per_petabyte_month/1000.0 ram_cost_USD_per_terabyte_day = 1000*ram_cost_kUSD_per_terabyte_month/31 storage_per_day_in_terabytes = storage_per_day_in_gigabytes/1024.0 storage_per_week_in_terabytes = 7*storage_per_day_in_terabytes storage_per_month_in_terabytes = 31*storage_per_day_in_terabytes storage_per_year_in_terabytes = 365*storage_per_day_in_terabytes print "storage per day in TB = %f - RAM-cost (per day) %f USD" % (storage_per_day_in_terabytes, storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day) print "storage per week in TB = %d - RAM-cost (per day) %f kUSD" % (storage_per_week_in_terabytes, 7*storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day/1000) print "storage per month in TB = %d - RAM-cost (per day) %f kUSD - RAM-cost (per year) %f Million USD" % (storage_per_month_in_terabytes, storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000, 365*storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000)) print "storage per year in TB = %d - RAM cost (per day) %f kUSD - RAM cost (per year) %f Million USD" % (storage_per_year_in_terabytes, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000)*365)Output (based on estimates from my small random tweet sample)
storage per day in TB = 0.264803 - RAM-cost (per day) 10.224809 USD storage per week in TB = 1 - RAM-cost (per day) 0.071574 kUSD storage per month in TB = 8 - RAM-cost (per day) 0.316969 kUSD - RAM-cost (per year) 0.115694 Million USD storage per year in TB = 96 - RAM cost (per day) 3.732055 kUSD - RAM cost (per year) 1.362200 Million USD3. Index calculations (upper bound)
# (extremely naive/stupid/easy-to-estimate-with) assumptions: # see e.g. http://cis.poly.edu/~hyan/sigIR-position.pdf for more realistic representations # 1) all the unique terms of all single tweets does not occur in other tweets # 2) there are now new terms from one day to another # i.e. the posting list per term increases in average by 1 (64 bit tweet id) every day) # 3) the posting lists are not compressed, i.e. storing 64 bit per list entry # 4) token themselves are keys # 5) no ranking/metadata/ngrams etc. for the indextoken_key_overhead = 2.0/8 num_tokens_in_index = num_tweets_per_day*avg_num_unique_tokens # each tweet provides an update to avg_num_unique_tokens entries in index key_contribution = num_tokens_in_index*(avg_token_length + token_key_overhead) index_size_per_day = key_contribution + num_tweets_per_day*avg_num_unique_tokens*64/8 index_size_per_week = key_contribution + num_tweets_per_day*avg_num_unique_tokens*7*64/8 index_size_per_month = key_contribution + num_tweets_per_day*avg_num_unique_tokens*31*64/8 index_size_per_year = key_contribution + num_tweets_per_day*avg_num_unique_tokens*365*64/8 index_size_per_day_in_terabytes = index_size_per_day/(1024*1024*1024) index_size_per_week_in_terabytes = index_size_per_week/(1024*1024*1024) index_size_per_month_in_terabytes = index_size_per_month/(1024*1024*1024) index_size_per_year_in_terabytes = index_size_per_year/(1024*1024*1024) # assuming slightly better encoding of posting lists, e.g. average of 1 byte per entry would give better_encoded = index_size_per_year_in_terabytes/8 print "index size per week in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_week_in_terabytes, index_size_per_week_in_terabytes*ram_cost_USD_per_terabyte_day/1000) print "index size per month in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_month_in_terabytes, index_size_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000) print "index size per year in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_year_in_terabytes, index_size_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000) print "index size per year in terabytes (better encoding) = %f - RAM-cost (per day) %f kUSD" % (better_encoded, better_encoded*ram_cost_USD_per_terabyte_day/1000)Index estimate outputs
index size per week in terabytes = 162.758202 - RAM-cost (per day) 6.284567 kUSD index size per month in terabytes = 669.268602 - RAM-cost (per day) 25.842404 kUSD index size per year in terabytes = 7718.205009 - RAM-cost (per day) 298.022303 kUSD index size per year in terabytes (better encoding) = 964.775626 - RAM-cost (per day) 37.252788 kUSDConclusion
Keeping 1 year worth of tweets (including metadata) and (a crude) index of them in-memory is costly, but not too bad. I.e. 1.36 Million USD to keep 1 years worth of tweets (124 billion tweets) for 1 year in an (distributed) in-memory hashtable (or the same amount of tweets stored in the same hashtable for one day costs approximately 3732 USD). The index size estimates are very rough (check out this paper for more realistic representations). The energy costs (to maintain and refresh the RAM) would add between 5-25% additional costs (see comments on previous blog post).Q: So, is it time to reconsider using hard drives and SSDs and consider going for RAM instead
A: yes, at least consider it and combine with Hadoop. Check out Stanford’s RAMCloud project, and their paper: The Case for RAMClouds:
Scalable High-Performance Storage Entirely in DRAM. There is still plenty of room for innovation for very-large-scale in-memory systems – there are some commercial vendors support systems with low-terabyte amounts of RAM (e.g. Teradata and Exalytics), but no (easily) available open source or commercial software support Petabyte-size RAM amounts.On a related note:
disclaimer: this posting have quite a few numbers, so the likelihood of errors is > 0, please let me know if you spot one.
Interested in large-scale in-memory key-value stores?
Check out atbr – http://github.com/atbrox/atbr
Source code for this posting?
https://github.com/atbrox/atbr/blob/master/blogposts/tweet_in_memory.py
Best regards,
Amund Tveit, co-founder of Atbrox
Attended Accel Partners Big Data conference last week. It was a good event with many interesting people, a very crude estimate of distribution: 1/3 VCs/investors, 1/3 startup tech people, 1/3 big corp tech people.
My personal 2 key takeaways from the conference:
I think Andy Becholsteim’s table titled “Memory Hierarchi is Not Changing” sums up the low-latency storage discussion quite good. I’ve taken the liberty to add a column with rough prices per Petabyte-month (calculation: estimated purchase-price divided by 12, note only the storage itself – not including all the hardware/network in order to run it) for RAM and SSD which are the only ones fit for low-latency AND big data. Note: I think mr. Becholsteim could have added up to petabytes for both SSD and RAM.
| Type of memory | Size | Latency | $ per Petabyte-month* (k$) |
|---|---|---|---|
| L1 cache | 64 KB | ~4 cycles (2 ns) | |
| L2 cache | 256 KB | ~10 cycles (5 ns) | |
| L3 cache (shared) | 8 MB | 35-40+ cycles (20 ns) | |
| Main memory | GBs up to terabytes | 100-400 cycles | 411 (non-ECC) 1,197 (ECC) |
| Solid state memory | GBs up to terabytes | 5,000 cycles | 94 |
| Disk | Up to petabytes | 1,000,000 cycles |
RAM (non-ECC): 16GB non-ECC (2x8GB) – price: $79, i.e. $79/16 per GB, $(79/16)K per TB, $(79/16)M per PB, $(79/16)M/12 per PB-month
RAM (ECC): 16GB ECC (1x16GB) – price: $229.98, i.e. $230/16 per GB, $(230/16)K per TB, $(230/16)M per PB, $(230/16)/12 per PB-month.
SSD: 512GB – price $579.99, i.e. $580/512 per GB, $(580/512)K per TB, $(580/512)M per PB, $(580/512)/12 per PB-month.
Since RAM-based storage is up to 50 times faster than SSD (latency-wise) but only roughly 4.3 to 12 times more expensive than SSD it is likely to become high on the agenda in settings where latency matter$ (all types of serving infrastructure, search, finance etc.). In absolute terms the costs for petabytes RAM have become within reach for all Fortune 1000 companies, i.e. about $1.1M per month for the storage alone (ECC RAM). One interesting thing about using RAM only is that for most systems using SSD or Disks there is also a big RAM component in addition, e.g. using memcached or caches various nosql storages, and by moving to RAM-only things might become simpler (i.e. avoiding dealing with memory-vs-disk/ssd-coherency and latency variations when not hitting the memory cache).
Note 1: If you have other sources for interesting large-scale RAM and SSD prices I would appreciate if you could add links to them in the comments below.
Note 2: If you’re interested in large-scale RAM-based key-value stores, check out our opensource project Atbr – github page: https://github.com/atbrox/atbr
Best regards,
Amund Tveit co-founder of Atbrox (@atbrox)

atbr (large-scale and low-latency in-memory key-value pair store) now supports Apache Thrift for easier integration with other Hadoop services.
Thrift Example
Checkout and install atbr
$ git clone git@github.com:atbrox/atbr.git $ cd atbr $ sudo ./INSTALL.sh
Prerequisite Install/compile Apache Thrift – http://thrift.apache.org/
Compile a atbr thrift server and connect using python client
$ cd atbrthrift $ make $ ./atbr_thrift_server # c++ server $ python test_atbr_thrift_client.py
Python thrift api example
from atbr_thrift_client import connect_to_atbr_thrift_service
service = connect_to_atbr_thrift_service("localhost", "9090")
service.load("keyvaluedata.tsv")
value = service.get("key1")
Stay tuned for other updates on atbr.
Rough roadmap
Documentation
atbr.atbrox.com
Best regards,