SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants “heavier data” in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)
Unstructured Search for SimpleDB
- Support unstructured search with very low maintenance
- Combine structured and unstructured search
- Figure out the feasibility of unstructured search on top of SimpleDB
The Structure of SimpleDB
SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable) has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name). Check out Wikipedia for detailed SimpleDB storage characteristics.
Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey - Inverted files for text search engines)
One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from “0″ to “249″ and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item – e.g. for stop words – one could “wrap” the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).
Preliminary query latency results
Warning: Data sets used were NLTK‘s inaugural collection, so far from the biggest.
Conclusion: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.
Preliminary Python code including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).