Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

The prior update of this posting was in May, and a lot has happened related to Mapreduce and Hadoop since then, e.g.
1) big software companies have started offering hadoop-based software (Microsoft and Oracle), 2) Hadoop-startups have raised record amounts, and 3) nosql-landscape becoming increasingly datawarehouse’ish and sql’ish with the focus on high-level data processing platforms and query languages.

Personally I have rediscovered Hadoop Pig and combine it with UDFs and streaming as my primary way to implement mapreduce algorithms here in Atbrox.

Best regards,
Amund Tveit (twitter.com/atveit)

Changes from the prior postings is that this posting only includes _new_ papers (2011):

Artificial Intelligence/Machine Learning/Data Mining

Bioinformatics/Medical Informatics

Image and Video Processing

Statistics and Numerical Mathematics

Search and Information Retrieval

Sets & Graphs

Simulation

Social Networks

Spatial Data Processing

Text Processing

Posted in hadoop, machine learning, mapreduce | 7 Comments

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)


It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Author organizations and companies?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

Atbrox on LinkedIn

Btw: I would like to recommend:

  1. Mapreduce bibliography maintained by (Cloudera co-founder) Jeff Hammerbacher
  2. (the excellent) book – Data-Intensive Text Processing with Mapreduce by (UMD’s/Twitter’s) Jimmy Lin and Christopher Dyer.

Let me know if you have input/corrections/feedback to this posting – amund @\h@ atbrox.com – or @atveit or @atbrox on twitter.

Best regards,
Amund Tveit (Atbrox co-founder)

Posted in Atbrox, cloud computing, Hadoop and Mapreduce | 16 Comments

Mapreduce in Search

Wrote about mapreduce in search in a presentation for next week.

(more up-to-date pdf version of the presentation)

Best regards,
Amund
Atbrox

Posted in Atbrox, Hadoop and Mapreduce, infrastructure, search | Tagged , , | 2 Comments

Atbrox spin off launches new media search engine

Our spin off startup company – Gravemaskinen AS – just launched a new media search engine in cooperation with Journalisten.no (magazine for Norwegian media professionals).

Some of the features include:

  • faceted’ish suggest search
  • realtime aggregation of search results for statistics
  • multiple views to support easy navigation and search
  • sentence-granularity search

(it is deployed on Amazon EC2)

Posted in cloud computing, search, startup | Leave a comment

An example of using F# and C# (.net/mono) with Amazon’s Elastic Mapreduce (Hadoop)

This posting gives an an example how F# and C# can scale potentially to up to thousands of machines with Mapreduce in order to efficiently process TeraByte (TB) and PetaByte (PB) data amounts. It shows a C# (c sharp) mapper function and a F# (f sharp) reducer function with a description on how to deploy the job on Amazon’s Elastic Mapreduce using bootstrap action (it was tested with an elastic mapreduce cluster of 10 machines).

The .net environment used is mono 2.8 and FSharp 2.0. The code described in this posting can be found on https://github.com/atbrox/atbroxexamples/

Mapreduce Code

C# mapper

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
 
public class WordCountMapper {
  static public void Main () {
    String line;
    while( (line = Console.ReadLine()) != null) {
      line.Trim().Split(' ')
       .Where(s => !String.IsNullOrEmpty(s.Trim()))
       .ToList()
       .ForEach(term => Console.Write(term.ToLower() + "\t1\n"));
    }
  }
}

Compiling c# code

mcs wordcountmapper.cs
mkbundle -o wordcountmapper wordcountmapper.exe --deps --static

F# reducer

open System.Net
open System
open System.IO

// note: F# has default immutable types (~ Erlang), 
// use explicit mutable keyword
let sumReducer (key:string, values:seq<string[]>) : int =
    let mutable count = 0 
    for value in values do
        count <- count  + Convert.ToInt32(value.[1])
    mysum

// note: |> below is similar to unix pipes 
let reduceInput (tr:TextReader) =
    Seq.initInfinite (fun _ -> tr.ReadLine())
    |> Seq.takeWhile (fun line -> line <> null) 
    |> Seq.map (fun line -> line.Split('\t','\n',' '))
    |> Seq.groupBy (fun line -> line.[0].Trim())

let reduceRunner (tr:TextReader) (reducef:string*seq<string[]>->int) = 
    for line in reduceInput tr do
        let key,values = line
        let wc = reducef(key,values)
        System.Console.WriteLine(key+"\t"+Convert.ToString(wc))

// run the sumReducer method on input data from stdin
reduceRunner Console.In sumReducer

Compiling fsharp code

cd FSharp-2.0.0.0
mono bin/fsc.exe wordcountreducer.fsx --standalone 
mkbundle -o wordcountreducer wordcountreducer.exe --deps --static

Deployment on Amazon’s Elastic Mapreduce

In order to run mono/.net code on Elastic Mapreduce (Debian) Linux nodes you need to install mono on each node, this can be done with a bootstrap action shell script. In order to deploy the mapreduce job itself there are several options, the one shown in this posting is using the Boto API for python. Note: this requires that you fetch upload the mono-2.8.2-parallel-environment-amd64.deb to your s3 BUCKET in advance.

Bootstrap action shell script for installing mono

#!/bin/bash
hadoop fs -copyToLocal s3://BUCKET/mono-2.8.2-parallel-environment-amd64.deb .
sudo dpkg -i mono-2.8.2-parallel-environment-amd64.deb 

Python script to deploy mapreduce and check status until it is done

import boto
import boto.emr
from boto.emr.step import StreamingStep
from boto.emr.bootstrap_action import BootstrapAction
import time

# set your aws keys and S3 bucket, e.g. from environment or .boto
AWSKEY=
SECRETKEY=
S3_BUCKET=
NUM_INSTANCES = 1
SLAVE_INSTANCE_TYPE = 'm1.small'
MASTER_INSTANCE_TYPE = 'm1.small'

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("installmono", "s3://" + S3_BUCKET + "/monobootstrap.sh",None)

step = StreamingStep(name='Wordcount',
                     mapper='s3n://' + S3_BUCKET + '/csharpmapper',
                     reducer='s3n://' + S3_BUCKET + '/fsharpreducer',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://' + S3_BUCKET + '/output')

jobid = conn.run_jobflow(
    name="emr with fsharp and csharp", 
    log_uri="s3://" + S3_BUCKET + "/logs", 
    steps = [step],
    bootstrap_actions=[bootstrap_step],
    num_instances=NUM_INSTANCES,
    slave_instance_type=SLAVE_INSTANCE_TYPE,
    master_instance_type=MASTER_INSTANCE_TYPE,
    enable_debugging=True
)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).state
print "job state = ", state
print "job id = ", jobid
while state != u'COMPLETED':
    print time.localtime()
    time.sleep(30)
    state = conn.describe_jobflow(jobid).state
    print "job state = ", state
    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" 
print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."

In order to run the python deploy job you need to install boto, update constants in top of file for AWS keys and S3 bucket and finally start it with python deploy.py

Conclusion
This posting has given an example of a simple – wordcount in C# and F# with how to deploy the job on Amazon’s Elastic Mapreduce. It has been tested on up 10 m1.large ec2 nodes. Code can checked out with git clone git@github.com:atbrox/atbroxexamples.git.

Atbrox on LinkedIn

Best regards,
Amund Tveit (amund atbrox.com)
Atbrox

Posted in cloud computing | Tagged , , , , , | 8 Comments