A large-scale in-memory storage example – social network data

This posting is a follow-up to the large-scale low-latency (RAM-based) storage related price estimates in my previous posting Main takeaways from Accel’s Big Data Conference.

Assume you were to store and index large amounts of social network updates in-memory, e.g. tweets.

1) fetch some tweets

curl https://stream.twitter.com/1/statuses/sample.json?delimited=length -uAnyTwitterUser:Password > yourfilename

2) gather some stats about tweets

import json
import zlib

all_tokens = []
num_kept = 0
uncompressed_lengths = []
compressed_lengths = []
num_tokens_per_tweet = []
num_unique_tokens_per_tweet = []
num_kept_tweets = 0
all_tokens = []

for line in file('yourfilename'):
    # skip non-json lines returned by APIs (lengths)
    if not line.startswith("{"):
        continue

    jline = json.loads(line)

    text = jline.get("text", " ").lower()

    # skips - for simplicity - tweets that can't be space-tokenized
    if not " " in text:
        continue

    # tweets with metadata
    uncompressed_lengths.append(len(line))
    compressed_lengths.append(len(zlib.compress(line)))

    # token calculations
    tokens = text.split(" ")
    num_tokens_per_tweet.append(len(tokens))
    num_unique_tokens_per_tweet.append(len(set(tokens)))
    token_lengths = [len(token) for token in tokens]
    all_tokens.extend(token_lengths)

    num_kept_tweets += 1

avg_uncompressed_length = (sum(uncompressed_lengths)+0.0)/num_kept_tweets
avg_compressed_length = (sum(compressed_lengths)+0.0)/num_kept_tweets
avg_num_tokens = (sum(num_tokens_per_tweet)+0.0)/num_kept_tweets
avg_num_unique_tokens = (sum(num_unique_tokens_per_tweet)+0.0)/num_kept_tweets
avg_token_length = (sum(all_tokens)+0.0)/len(all_tokens)

print "average uncompressed length = ", avg_uncompressed_length
print "average compressed length = ", avg_compressed_length
print "average num tokens = ", avg_num_tokens
print "average num unique tokens = ", avg_num_unique_tokens
print "average token length = ", avg_token_length
print "number of tweets = ", num_kept_tweets

Output for my small random tweet sample

average uncompressed length =  2099.60084926
average compressed length =  848.08492569
average num tokens =  8.91507430998
average num unique tokens =  8.33121019108
average token length =  5.44582043344
number of tweets =  471

Calculate based on published amounts of tweets – 340M tweets per day, ref: thenextweb.

num_tweets_per_day = 340000000
one_gigabyte = 1024*1024*1024
keysize = 64/8 # 64 bit keys

hash_overhead = 2.0/8 # 2 bit overhead, assuming memory-efficient hashtable

storage_per_day_in_gigabytes = num_tweets_per_day*avg_compressed_length/one_gigabyte + num_tweets_per_day*(keysize+hash_overhead)/one_gigabyte

ram_cost_kUSD_per_petabyte_month = 1197
ram_cost_kUSD_per_terabyte_month = ram_cost_kUSD_per_petabyte_month/1000.0
ram_cost_USD_per_terabyte_day = 1000*ram_cost_kUSD_per_terabyte_month/31

storage_per_day_in_terabytes = storage_per_day_in_gigabytes/1024.0
storage_per_week_in_terabytes = 7*storage_per_day_in_terabytes
storage_per_month_in_terabytes = 31*storage_per_day_in_terabytes
storage_per_year_in_terabytes = 365*storage_per_day_in_terabytes

print "storage per day in TB = %f - RAM-cost (per day) %f USD" % (storage_per_day_in_terabytes, storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day)
print "storage per week in TB = %d - RAM-cost (per day) %f kUSD" % (storage_per_week_in_terabytes, 7*storage_per_day_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "storage per month in TB = %d - RAM-cost (per day) %f kUSD - RAM-cost (per year) %f Million USD" % (storage_per_month_in_terabytes, storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000, 365*storage_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000))
print "storage per year in TB = %d - RAM cost (per day) %f kUSD - RAM cost (per year) %f Million USD" % (storage_per_year_in_terabytes, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000, storage_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/(1000*1000)*365)

Output (based on estimates from my small random tweet sample)

storage per day in TB = 0.264803 - RAM-cost (per day) 10.224809 USD
storage per week in TB = 1 - RAM-cost (per day) 0.071574 kUSD
storage per month in TB = 8 - RAM-cost (per day) 0.316969 kUSD - RAM-cost (per year) 0.115694 Million USD
storage per year in TB = 96 - RAM cost (per day) 3.732055 kUSD - RAM cost (per year) 1.362200 Million USD

3. Index calculations (upper bound)

# (extremely naive/stupid/easy-to-estimate-with) assumptions:
#   see e.g. http://cis.poly.edu/~hyan/sigIR-position.pdf for more realistic representations
# 1) all the unique terms of all single tweets does not occur in other tweets
# 2) there are now new terms from one day to another
#    i.e. the posting list per term increases in average by 1 (64 bit tweet id) every day)
# 3) the posting lists are not compressed, i.e. storing 64 bit per list entry
# 4) token themselves are keys
# 5) no ranking/metadata/ngrams etc. for the index
token_key_overhead = 2.0/8
num_tokens_in_index = num_tweets_per_day*avg_num_unique_tokens

# each tweet provides an update to avg_num_unique_tokens entries in index

key_contribution = num_tokens_in_index*(avg_token_length + token_key_overhead)

index_size_per_day = key_contribution + num_tweets_per_day*avg_num_unique_tokens*64/8
index_size_per_week = key_contribution + num_tweets_per_day*avg_num_unique_tokens*7*64/8
index_size_per_month = key_contribution + num_tweets_per_day*avg_num_unique_tokens*31*64/8
index_size_per_year = key_contribution + num_tweets_per_day*avg_num_unique_tokens*365*64/8

index_size_per_day_in_terabytes = index_size_per_day/(1024*1024*1024)
index_size_per_week_in_terabytes = index_size_per_week/(1024*1024*1024)
index_size_per_month_in_terabytes = index_size_per_month/(1024*1024*1024)
index_size_per_year_in_terabytes = index_size_per_year/(1024*1024*1024)

# assuming slightly better encoding of posting lists, e.g. average of 1 byte per entry would give
better_encoded = index_size_per_year_in_terabytes/8

print "index size per week in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_week_in_terabytes, index_size_per_week_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "index size per month in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_month_in_terabytes, index_size_per_month_in_terabytes*ram_cost_USD_per_terabyte_day/1000)
print "index size per year in terabytes = %f - RAM-cost (per day) %f kUSD" % (index_size_per_year_in_terabytes, index_size_per_year_in_terabytes*ram_cost_USD_per_terabyte_day/1000)

print "index size per year in terabytes (better encoding) = %f - RAM-cost (per day) %f kUSD" % (better_encoded, better_encoded*ram_cost_USD_per_terabyte_day/1000)

Index estimate outputs

index size per week in terabytes = 162.758202 - RAM-cost (per day) 6.284567 kUSD
index size per month in terabytes = 669.268602 - RAM-cost (per day) 25.842404 kUSD
index size per year in terabytes = 7718.205009 - RAM-cost (per day) 298.022303 kUSD
index size per year in terabytes (better encoding) = 964.775626 - RAM-cost (per day) 37.252788 kUSD

Conclusion
Keeping 1 year worth of tweets (including metadata) and (a crude) index of them in-memory is costly, but not too bad. I.e. 1.36 Million USD to keep 1 years worth of tweets (124 billion tweets) for 1 year in an (distributed) in-memory hashtable (or the same amount of tweets stored in the same hashtable for one day costs approximately 3732 USD). The index size estimates are very rough (check out this paper for more realistic representations). The energy costs (to maintain and refresh the RAM) would add between 5-25% additional costs (see comments on previous blog post).

Q: So, is it time to reconsider using hard drives and SSDs and consider going for RAM instead
A: yes, at least consider it and combine with Hadoop. Check out Stanford’s RAMCloud project, and their paper: The Case for RAMClouds:
Scalable High-Performance Storage Entirely in DRAM
. There is still plenty of room for innovation for very-large-scale in-memory systems – there are some commercial vendors support systems with low-terabyte amounts of RAM (e.g. Teradata and Exalytics), but no (easily) available open source or commercial software support Petabyte-size RAM amounts.

On a related note:

disclaimer: this posting have quite a few numbers, so the likelihood of errors is > 0, please let me know if you spot one.

Interested in large-scale in-memory key-value stores?
Check out atbrhttp://github.com/atbrox/atbr

Source code for this posting?
https://github.com/atbrox/atbr/blob/master/blogposts/tweet_in_memory.py

Best regards,

Amund Tveit, co-founder of Atbrox

This entry was posted in in-memory, information retrieval, infrastructure, RAM. Bookmark the permalink.

Comments are closed.