Oct 27

SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for  a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants “heavier data” in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)

Unstructured Search for SimpleDB

This posting presents an approach of how to add (flexible) unstructured search support to SimpleDB (with some preliminary query latency numbers below – and very preliminary python code). The motivation is:
  1. Support unstructured search with very low maintenance
  2. Combine structured and unstructured search
  3. Figure out the feasibility of unstructured search on top of SimpleDB

The Structure of SimpleDB

SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable)  has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name). Check out Wikipedia for detailed SimpleDB storage characteristics.

Inverted files

Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey – Inverted files for text search engines)

One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from “0” to “249” and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item – e.g. for stop words – one could “wrap” the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).

Preliminary query latency results

Warning: Data sets used were  NLTK‘s inaugural collection, so far from the biggest.

Inverted File Entry Fetch latency Distribution (in seconds)

Conclusion: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.

Preliminary Python code including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Oct 07

Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:

  1. Start a small EC2 node with AMI similar to the one Elastic Mapreduce is using (Debian Lenny Linux)
  2. Skim quickly through the Shedskin tutorial
  3. Log into the EC2 node and install the Shedskin Python compiler
  4. Write your Python mapper or reducer program and compile it into C++ with Shedskin
    • E.g. the commandpython ss.py mapper.py – would generate C++ files mapper.hpp and mapper.cpp, a Makefile and an annotated Python file mapper.ss.py.
  5. Optionally update the C++ code generated by Shedskin to use other C or C++ libraries
    • note: with Fortran-to-C you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with OpenCobol (compiling Cobol into C). Please let us know if you try or need help with help that.
  6. Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
  7. Compile the C++ code into a binary with make and check that you don’t get a dynamic executable with ldd (you want a static executable)
  8. Run strip on the binary to make it smaller
  9. Upload your (ready) binary to a chosen location in Amazon S3
  10. Read Elastic Mapreduce Documentation on how to use the binary to run Elastic Mapreduce jobs.

Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.

Note: this approach should probably work also with Cloudera’s distribution for Hadoop.


Do you need help with Hadoop/Mapreduce?
A good start could be to read this book, or

contact Atbrox if you need help with development or parallelization of algorithms for Hadoop/Mapreduce – info@atbrox.com. See our posting for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
Sep 21

If you are new to virtualenv, Fabric or pip is, Alex Clemesha’s excellent “Tools of the Modern Python Hacker” is a must-read.

In short: virtualenv lets you switch seamlessly between isolated Python environments, Fabric automates remote deployment, while pip takes care of installing required packages and dependencies. If you have ever had to wrestle with more than one development project at the same time, then virtualenv is one of those tools that, once mastered, you can’t see yourself living without. Fabric and pip are somewhat immature, but still highly useful in their present shapes. It is likely that you will end up learning them anyway. Best of all, these three tools play very nicely together.

Except on Cygwin.

Here at Atbrox, we spend quite a lot of our time on Windows platforms. While Cygwin adds a fair amount of unix functionality to Windows, configuring certain applications can be difficult. This article describes the steps we go through to get an operational virtualenv, Fabric and pip setup on Windows Vista. It also gives you a brief taster of how virtualenv and Fabric works.

Step 1 – Install Cygwin: If you haven’t already, Cygwin can be installed from this page. Click the “View” button once to get a full list of available packages. Make sure to include at least the following packages (the numbers in the parentheses indicate the versions used at the time of writing):

  • python (2.5.2-1)
  • python-paramiko (1.7.4-1)
  • python-crypto (2.0.1-1)
  • gcc (3.4.4-999)
  • wget (1.11.4-3)
  • openssh (5.1p1-10)

Now would also be a good time to install other common packages such as vim, git, etc.—but you can always go back and install them at a later time.

Note that we are using Cygwin Python rather than the standard Windows Python. I had nothing but trouble trying to get Windows Python to play nicely along with virtualenv and Fabric, so this is a compromise. The downside is that you are stuck with a rather dated and somewhat buggy version of Python. If someone manages to get this setup working with Windows Python, then let me know!

Step 2 – Get paramiko working: The python-paramiko and python-crypto packages are required to get Fabric deployment over SSH working properly. If you are lucky, paramiko should work out of the box. If you don’t get the following error message when importing paramiko then skip the rest of this step:

$ python
Python 2.5.2 (r252:60911, Dec  2 2008, 09:26:14)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import paramiko
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "__init__.py", line 69, in <module>
 File "transport.py", line 32, in <module>
 File "util.py", line 31, in <module>
 File "common.py", line 101, in <module>
 File "rng.py", line 69, in __init__
 File "randpool.py", line 87, in __init__
 File "randpool.py", line 120, in _randomize
IOError: [Errno 0] Error

According to the discussion here, this appears to be a lingering Cygwin bug. The workaround is to change line 120 in /usr/lib/python2.5/site-packages/Crypto/Util/randpool.py from


if num!=2 : raise IOError, (num, msg)

to

if num!=2 and num!=0 : raise IOError, (num, msg)

Paramiko should now import without any complaints.

Step 3 – Install setuptools: Setuptools are required for installing the rest of the required Python packages. Instructions for Cygwin are found on the setuptools pages—but just enter the following and you’ll be all set:

$ wget http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c9-py2.5.egg
$ sh setuptools-0.6c9-py2.5.egg

Step 4 – Install pip, virtualenv and virtualenvwrapper: We haven’t said anything about virtualenvwrapper so far. This extension to virtualenv streamlines working with multiple environments and is well recommended:

$ easy_install pip
$ easy_install virtualenv
$ easy_install virtualenvwrapper
$ mkdir ~/.virtualenvs

That last line creates a working directory for your virtual Python environments. When e.g. working with an environment named myenv, all packages will be installed in ~/.virtualenvs/myenv.

I find it useful to create and activate a default environment called sandbox. This helps prevent package installations to the default Python site-packages. It’s a good strategy in general to avoid polluting the main package directory so that almost all package installations are per project and virtual environment. Run the following commands to create the sandbox environment:

$ export WORKON_HOME=$HOME/.virtualenvs
$ export PIP_VIRTUALENV_BASE=$WORKON_HOME
$ source /usr/bin/virtualenvwrapper_bashrc
$ mkvirtualenv sandbox

mkvirtualenv is a virtualenvwrapper command that creates the given environment. If you get an IOError: [Errno 2] No such file or directory: '/usr/local/bin/python2.5' you will have to add a symbolic link to the Python executable:

$ ln -s /usr/bin/python2.5.exe /usr/bin/python2.5

Note that whenever you execute a shell command, the bash prompt will remind you of the active environment:

$ echo "foo"
foo
(sandbox)

To make the sandbox activation permanent, append the following lines to your ~/.bashrc:

export WORKON_HOME=$HOME/.virtualenvs
export PIP_VIRTUALENV_BASE=$WORKON_HOME
source /usr/bin/virtualenvwrapper_bashrc
workon sandbox

The workon is another virtualenvwrapper extension that switches you to the given environment. To get a full list of available environments, type workon without an argument. Other useful commands are deactivate to step out of the currently active environment, and rmvirtualenv to delete an environment. Refer to the virtualenvwrapper documentation for the whole story.

As a sanity check, try exiting and restarting the Cygwin shell. If you have paid attention so far, you should now automatically end up in the sandbox environment.

Step 5 – Install Fabric: From this point and on, all installed packages, including Fabric, will end up in a virtual environment. Fabric is undergoing a major rewrite right now, so given that its interface is quite unstable it is preferable to have a per-project installation anyway.

First we create a test environment named myproject:

$ mkvirtualenv myproject

We have to make some modifications to the Fabric source code, so we can’t use pip for installing it. Make sure to use version 0.9 or higher, as version 0.1 is already quite outdated:

$ mkdir ~/tmp
$ cd ~/tmp
$ wget http://git.fabfile.org/cgit.cgi/fabric/snapshot/fabric-0.9b1.tar.gz
$ tar xzf fabric-0.9b1.tar.gz
$ cd fabric-0.9b1

Fabric is run using the fab command, but if we try to install it as is, the following error might show up:

$ fab
Traceback (most recent call last):
 File "/home/brox/.virtualenvs/myproject/bin/fab", line 8, in <module>
   load_entry_point('Fabric==0.1.1', 'console_scripts', 'fab')()
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 277, in load_entry_point
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 2180, in load_entry_point
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/setuptools
-0.6c9-py2.5.egg/pkg_resources.py", line 1913, in load
 File "/home/brox/.virtualenvs/myproject/lib/python2.5/site-packages/fabric.py"
, line 53, in <module>
   import win32api
ImportError: No module named win32api

At the time of writing there is a small bug in Fabric that is likely to be fixed in the near future. For now you have to manually modify a file in fabric/state.py before you install. Change the line that says

win32 = sys.platform in ['win32', 'cygwin']

to

win32 = sys.platform in ['win32']

This is just to tell Fabric that Cygwin isn’t really Windows and that the win32api module therefore isn’t available. Having made the necessary change, do a regular installation from source:

$ python setup.py install

The following error message about paramiko not being found might pop up; just ignore it:

local packages or download links found for paramiko==1.7.4
error: Could not find suitable distribution for Requirement.parse('paramiko==1.7.4')

And that’s it! You should now have a fully functional virtualenv/Fabric/pip setup. To verify that Fabric works, create a file called fabfile.py:

from fabric.api import local, run

def local_test():
    local('echo "foo"')

def remote_test():
    run('uname -s')

This file, of course, only scratches the surface of what you can do with Fabric—refer to the latest documentation for more information.

To test the fabfile, type the following:

$ fab local_test
[localhost] run: echo "foo"

Done.

The biggest issue is that of getting Fabric to play along with your SSH installation so that you can deploy on remote servers. (You did install the openssh package, right?). Try the following command, substituting test@atbrox.com with one of your own accounts:

$ fab remote_test
No hosts found. Please specify (single) host string for connection: test@atbrox.com
[test@atbrox.com] run: uname -s
Password:
[test@atbrox.com] out: Linux

Done.
Disconnecting from test@atbrox.com... done.

The next step would be to set up password-less logins, but that is a different story.

Afterthoughts: While Cygwin is a lifesaver, it has some quirks and annoyances that may or may not be an issue depending on your system configuration. For instance, on my setup the following error tends to show up randomly when using Fabric for remote deployment:

sem_init: Resource temporarily unavailable
Traceback (most recent call last):
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/main.py", line 454, in main
 File "/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py", line 187, in
deploy
   _prepare_host_global()
 File "/cygdrive/c/Users/brox/workspace/quote_finder/fabfile.py", line 137, in
_prepare_host_global
   if not exists(u'/usr/bin/virtualenvwrapper_bashrc'):
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py", line 32, in
exists
 File "/usr/lib/python2.5/contextlib.py", line 33, in __exit__
   self.gen.throw(type, value, traceback)
 File "/usr/lib/python2.5/contextlib.py", line 118, in nested
   yield vars
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/contrib/files.py", line 32, in
exists
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/network.py", line 371, in host
_prompting_wrapper
 File "build/bdist.cygwin-1.5.25-i686/egg/fabric/operations.py", line 422, in r
un
 File "channel.py", line 297, in recv_exit_status
 File "/usr/lib/python2.5/threading.py", line 368, in wait
   self.__cond.wait(timeout)
 File "/usr/lib/python2.5/threading.py", line 210, in wait
   waiter = _allocate_lock()
thread.error: can't allocate lock

This is a known problem that is not likely to go away anytime soon, due to an inherent race condition in Cygwin’s implementation of sem_init. Still, having a functional virtualenv/Fabric/pip environment on Windows is all in all pretty convenient.

There is a slew of useful articles out there if you need more information on the tools described in this article. These are my current favorites:


Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Tagged with:
preload preload preload