Quotation Theußl, Stefan. 2009. Simple Parallel Computing with Hadoop. Rmetrics Workshop, Meielisalp, Schweiz, 28.06.-02.07..




The availability of very large and steadily growing data sets poses new challenges for efficient data handling mainly due to architectural performance limits of single processor environments and memory restrictions. On the other hand we observe an increasing availability of multicore architectures even in commodity computers and high performance computing environments, i.e., distributed and highly integrated computing clusters. In this context, we propose to make use of a technique called MapReduce which is widely used in high performance computing because of its functional programming nature. Apache Hadoop is an open source Java software framework implementing MapReduce and thus supports massive data processing across a cluster of workstations. We present the parallel programming model MapReduce, its implementation Hadoop, and how this framework can be used in conjunction with R. Based on an example in text mining we illustrate how one may break up existing building blocks into suitable entities for parallel processing and how Hadoop can be used to encapsulate this parallelism within R. Furthermore, we show how distributed storage can be utilized to facilitate parallel processing of large and very large data sets. This approach offers us a simple, flexible, and scalable parallel computing solution.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Paper presented at an academic conference or symposium
Language English
Title Simple Parallel Computing with Hadoop
Event Rmetrics Workshop
Year 2009
Date 28.06.-02.07.
Country Switzerland
Location Meielisalp
URL http://www.rmetrics.org/


Theußl, Stefan (Details)
Institute for Statistics and Mathematics IN (Details)
Research Institute for Computational Methods FI (Details)
Research areas (ÖSTAT Classification 'Statistik Austria')
1100 Mathematics, information technology (Details)
1128 Supercomputing (Details)
Google Scholar: Search