atbrox | Accelerates Innovation with Code

Workshop: Mapreduce’12 – 3rd International Workshop on Mapreduce and its applications)

Posted on November 30, 2011 by Amund Tveit

If you are interested in Mapreduce or Hadoop I recommend submitting to or attending the following workshop.

The Third International Workshop on MapReduce and its Applications (MAPREDUCE’12)
June 18-19, 2012 HPDC’2012, Delft, the Netherlands.
http://graal.ens-lyon.fr/mapreduce/

SCOPE
=====

Since its introduction in 2004 by Google, MapReduce has become the programming
model of choice for processing large data sets. MapReduce borrows from functional
programming, where a programmer can define both a Map task that maps a data set
into another data set, and a Reduce task that combines intermediate outputs into
a final result. Although MapReduce was originally developed for use by web
enterprises in large data-centers, this technique has gained a lot of attention
from the scientific community for its applicability in large parallel data
analysis (including geographic, high energy physics, genomics, etc..).

The purpose of the workshop is to provide a forum for discussing recent advances,
identifying open issues, introducing developments and tools, and presenting
applications and enhancements for MapReduce (or very similar) systems. We
therefore cordially invite contributions that investigate these issues, introduce
new execution environments, apply performance evaluations and show the
applicability to science and enterprise applications.

TOPICS OF INTEREST
==================

– MapReduce implementation issues and improvements
– Implementation optimization for GPU and multi-core systems
– Extensions to the programing model
– Large-scale MapReduce (Grid and Desktop Grid)
– Use of CDN and P2P techniques
– Heterogeneity and fault-tolerance
– Scientific data-sets analysis
– Data and compute-intensive applications
– Tools and environments for MapReduce
– Algorithms using the MapReduce paradigm

PAPER SUBMISSIONS
=================

Authors are invited to submit full papers of at most 8 pages, including all
figures and references. Papers should be formatted in the ACM proceedings style
(e.g., http://www.acm.org/sigs/publications/proceedings-templates). Submitted
papers must be original work that has not appeared in and is not under
consideration for another conference or a journal. Accepted papers will be
published by ACM in the conference workshops proceedings.

Papers should be submitted here: TO BE ANNOUNCED ON WEBSITE

IMPORTANT DATES
===============

– Manuscript submission deadline : February 25, 2012
– Acceptance notification : March 26, 2012
– Camera-ready paper deadline : April 16, 2012
– Workshop dates : June 18-19, 2012

ORGANIZATION COMMITTEE
======================

General Chairs
==============

– Gilles Fedak, INRIA/LIP (contact: gilles.fedak@inria.fr)
– Geoffrey Fox, Indiana University (contact: gcf@cs.indiana.edu)

Program Chair
=============

Simon Delamare, INRIA/LIP (contact: simon.delamare@inria.fr)

Publicity chair
===============

Haiwu He, INRIA/LIP (contact: haiwu.he@inria.fr)

PROGRAM COMMITTEE
=================

– Alexandre de Assis Bento Lima, Federal University of Rio de Janeiro
– Amund Tveit, Atbrox
– Carlo Mastroianni, ICAR-CNR
– Christian Engelmann, Oak Ridge National Laboratory
– Francisco V. Brasileiro, Federal University of Campina Grande
– Frédéric Suter, IN2P3/CNRS
– Gabriel Antoniu, INRIA
– Heithem Abbes, Faculty of Sciences of Tunis
– Heshan Lin, Virginia Polytechnic Institute and State University
– Hidemoto Nakada, AIST
– Jerry Zhao, Google (tech.lead for Google’s Mapreduce team, sorting PetaBytes)
– José A.B. Fortes, University of Florida
– Judy Qiu, Indiana University
– Michael C. Schatz, Cold Spring Harbor Laboratory
– Oleg Lodygensky, CNRS
– Shantenu Jha, Louisiana State University
– Xuanhuan Shi, Huazhong University of Science and Technology
– Yang Yang, Netflix

Best regards,
Amund Tveit, Atbrox

Posted in hadoop, mapreduce | Tagged hadoop | Leave a comment

Workshop: Searching for fun

Posted on November 23, 2011 by Amund Tveit

If you are interested in search I recommend you to consider submitting a paper to or attending the Searching 4 fun workshop* (I just joined as a program committee member) which is going to be held in Barcelona in April 2012.

Call for Papers
The topics of the workshop will be evaluation focused and include but are not be limited to:

Understanding information needs and search behaviour in casual-leisure situations.
How existing systems are used in casual-leisure searching scenarios.
Systems / Interfaces / Algorithmic approaches to supporting Search in Casual-leisure situations.
Use of Recommender Systems for Entertaining Content (books, movies, videos, music, websites).
Modelling of users interests and generation of accurate and appropriate user profiles.
Interfaces for exploratory search for casual-leisure situations.
Evaluation (methods, metrics) of Casual-leisure searching situations.

We are seeking short 2-4 page position papers in this area, or short papers reporting early or formative results in the area of searching for fun.

Reviewing and Publishing

Papers will be reviewed by an international program committee, and we intend to publish the accepted papers using the CEUR Workshop Proceedings service.

Submissions

Participants should submit anonymised 2-4 page ACM format PDFs to the easychair page. More details are in our submissions page.

Organizers

Program Committee

Pertti Vakkari, Tampere, Finland
Elaine Toms, Sheffield, UK
Ryen White, Microsoft Research, USA
Leif Azzopardi, Glasgow, UK
Bernd Ludwig, Goethe, Germany
Ian Ruthven, Strathclyde, UK
Daniel Tunkelang, LinkedIn, USA
Pablo Castells, Madrid, Spain
Richard Schaller, Erlangen, Germany
Stefan Mandl, Augsburg, Germany
Amund Tveit, Atbrox, Norway

Twitter
We’ll be using the #search4fun hashtag where possible
(ECIR Conference Twitter account is @ecir2012)

Link to web pages for the workshop: fitlab.eu/searching4fun/cfp.php

(* at the 34th European Conference on Information Retrieval – ECIR 2012)

Best regards,
Amund Tveit

Posted in cfp, entertainment, information retrieval, query intent, search, workshop | Leave a comment

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

Posted on November 9, 2011 by Amund Tveit

The prior update of this posting was in May, and a lot has happened related to Mapreduce and Hadoop since then, e.g.
1) big software companies have started offering hadoop-based software (Microsoft and Oracle), 2) Hadoop-startups have raised record amounts, and 3) nosql-landscape becoming increasingly datawarehouse’ish and sql’ish with the focus on high-level data processing platforms and query languages.

Personally I have rediscovered Hadoop Pig and combine it with UDFs and streaming as my primary way to implement mapreduce algorithms here in Atbrox.

Best regards,
Amund Tveit (twitter.com/atveit)

Changes from the prior postings is that this posting only includes _new_ papers (2011):

Posted in hadoop, machine learning, mapreduce | 7 Comments

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Posted on May 16, 2011 by Amund Tveit

Follow @atbrox

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Ads & E-commerce

Improving ad relevance in sponsored search

Predicting the Click-Through Rate for Rare/New Ads

Learning Influence Probabilities in Social Networks

Mining advertiser-specific user behavior using adfactors

Extracting user profiles from large scale data

Large-Scale Behavioral Targeting

Search Advertising using Web Relevance Feedback

Predicting Ads’ ClickThrough Rate with Decision Rules

A stochastic learning-to-rank algorithm and its application to contextual advertising

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Learning website hierarchies for keyword enrichment in contextual advertising

Astronomy
*Algorithms for Large-Scale Astronomical Problems (2011)

Social Networks
*Social Content Matching in MapReduce (2011)
*Parallel Knowledge Community Detection Algorithm Research Based on MapReduce (2011)
*Large-Scale Community Detection on YouTube for Topic Discovery and Exploration (2011)

Bioinformatics/Medical Informatics
A novel approach to multiple sequence alignment using hadoop data grids
MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
*HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data (2011)
*Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing (2011)

Machine Translation
Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
*Fast, Easy and Cheap: Construction of Statistical Machine Translation Models with Mapreduce

Spatial Data Processing
Experiences on Processing Spatial Data with MapReduce
*Scalable spatio-temporal knowledge harvesting (2011)

Information Extraction and Text Processing
Statistical Sentence Chunking Using Map Reduce
Data-intensive text processing with MapReduce
Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
*Batch Text Similarity Search with MapReduce (2011)
*An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text (2011)
*EntityTagger: automatically tagging entities with descriptive phrases (2011)

Artificial Intelligence/Machine Learning/Data Mining
LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
Stateful Bulk Processing for Incremental Analytics
Mining dependency in distributed systems through unstructured logs analysis
Beyond online aggregation: parallel and incremental data mining with online mapreduce
Learning based opportunistic admission control algorithm for mapreduce as a service
OWL reasoning with WebPIE: calculating the closure of 100 billion triples
Scaling ECGA model building via data-intensive computing
SPARQL basic graph pattern processing with iterative mapreduce
Residual Splash for Optimally Parallelizing Belief Propagation
Stochastic gradient boosted distributed decision trees
Distributed Algorithms for Topic Models
When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
Parallel K-Means Clustering Based on MapReduce
Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Parallel algorithms for mining large-scale rich-media data
Scaling Simple and Compact Genetic Algorithms using MapReduce
Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
*Preliminary Results on Using Matching Algorithms in Map-Reduce Applications (2011)
*Improving the Effectiveness of Statistical Feature Selection Algorithms Using Bag of Synsets and its Parallelization (2011)
*Tri-training and MapReduce-based massive data learning (2011)
*Parallel evolutionary approach of compaction problem using mapreduce (2011)
*COMET: A Recipe for Learning and Using Large Ensembles on Massive Data (2011)
*Parallelized K-Means clustering algorithm for self aware mobile ad-hoc networks (2011)

previous blog post

Search Query Analysis
Parallelizing Random Walk with Restart for large-scale query recommendation
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)

Information Retrieval (Search)
Automatically Incorporating New Sources in Keyword Search-Based Data Integration
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Learning URL patterns for webpage de-duplication
Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser
MIREX: Mapreduce Information Retrieval Experiments
Efficient Clustering of Web Derived Data Sets
The PageRank algorithm and application on searching of academic papers
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
*Scalable knowledge harvesting with high precision and high recall (2011)
*MapReduce indexing strategies: Studying scalability and efficiency (2011)
*Ranking on large-scale graphs with rich metadata (2011)
*Distributed Index for Near Duplicate Detection (2011)
*SPRINT: ranking search results by paths (2011)
*Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models (2011)
*Sparse hidden-dynamics conditional random fields for user intent understanding (2011)

Mapreduce in Search

Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
– Clustering of emails into spam campaign
– Finding probability that 2 spam messages are sent form same machine
– Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Web-based Malware (2007)

Image and Video Processing
Font rendering on a GPU-based raster image processor
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Video Stream Re-Rendering
Map-Reduce Meets Wider Varieties of Applications (2008)
– Location detection in images
*Counting triangles and the curse of the last reducer (2011)
*Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments (2011)

Networking
Reducible Complexity in DNS

Simulation
Map-Reduce Meets Wider Varieties of Applications (2008)
– Simulation of earthquakes (geology)

Statistics
User-based collaborative filtering recommendation algorithms on hadoop
Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
– Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
– Measuring Wikipedia Editor similarity
Map-Reduce Meets Wider Varieties of Applications (2008)
– Netflix video recommendation
Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

Numerical Mathematics
Distributed non-negative matrix factorization for dyadic data analysis on mapreduce
A mapreduce algorithm for SC
Multi-GPU Volume Rendering using MapReduce
Mapreduce for Integer Factorization
*Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (2011)

Sets & Graphs
Towards scalable RDF graph analytics on MapReduce
Efficient Parallel Set-Similarity Joins using Mapreduce
Max-cover algorithm in map-reduce
Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)
*Filtering: A Method for Solving Graph Problems in MapReduce (2011)
*Colorful Triangle Counting and a MapReduce Implementation (2011)
*Mining Large Graphs: Algorithms, Inference, and Discoveries (2011)
*On labeled paths (2011)
*HADI: Mining radii of large graphs (2011)
*Towards Efficient Subgraph Search in Cloud Computing Environment (2011)

Author organizations and companies?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

Btw: I would like to recommend:

Mapreduce bibliography maintained by (Cloudera co-founder) Jeff Hammerbacher
(the excellent) book – Data-Intensive Text Processing with Mapreduce by (UMD’s/Twitter’s) Jimmy Lin and Christopher Dyer.

Let me know if you have input/corrections/feedback to this posting – amund @\h@ atbrox.com – or @atveit or @atbrox on twitter.

Best regards,
Amund Tveit (Atbrox co-founder)

Posted in Atbrox, cloud computing, Hadoop and Mapreduce | 16 Comments

Mapreduce in Search

Posted on April 9, 2011 by Amund Tveit

Wrote about mapreduce in search in a presentation for next week.

Mapreduce in Search

(more up-to-date pdf version of the presentation)

Best regards,
Amund
Atbrox

Posted in Atbrox, Hadoop and Mapreduce, infrastructure, search | Tagged information retrieval, mapreduce, search | 2 Comments

Workshop: Mapreduce’12 – 3rd International Workshop on Mapreduce and its applications)

Workshop: Searching for fun

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

Artificial Intelligence/Machine Learning/Data Mining

Bioinformatics/Medical Informatics

Image and Video Processing

Statistics and Numerical Mathematics

Search and Information Retrieval

Sets & Graphs

Simulation

Social Networks

Spatial Data Processing

Text Processing

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Mapreduce in Search

Archives

Meta