Quotation Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt. 2009. Distributed Text Mining with tm. useR!, Rennes, Frankreich, 08.07.-10.07..




Text mining is a widely used technique utilizing statistical and machine learning methods to extract patterns or knowledge from large unstructured text data sets. Recently R has gained explicit text mining support via the tm package. This infrastructure provides sophisticated methods for document handling, transformations, filters, and data export (e.g., term-document matrices). However, the availability of very large and always growing text corpora poses new challenges for efficient handling of these data sets mainly due to architectural performance limits of single processor environments and memory restrictions. On the other hand we observe an increasing availability of multicore architectures even in commodity computers and high performance computing environments, i.e., distributed and highly integrated computing clusters. In this context, we propose to make use of a technique called MapReduce which is widely used in high performance computing because of its functional programming nature. Existing building blocks in tm allow for adding new layers to support this kind of parallelism and distributed allocation. In particular we identify compute-intensive parts of tm, break these parts up into suitable entities for parallel processing and finally encapsulate the emerging parallelism in a functional programming style. A key factor in large scale text mining is the efficient management of data. Therefore, we show how distributed storage can be utilized to facilitate parallel processing of large and very large data sets. This approach offers us a reliable, flexible, and scalable high performance computing solution for distributed text mining.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Paper presented at an academic conference or symposium
Language English
Title Distributed Text Mining with tm
Event useR!
Year 2009
Date 08.07.-10.07.
Country France
Location Rennes
URL http://www.r-project.org/useR-2009


Theußl, Stefan (Former researcher)
Hornik, Kurt (Details)
Feinerer, Ingo (TU Wien, Austria)
Institute for Statistics and Mathematics IN (Details)
Research Institute for Computational Methods FI (Details)
Research areas (ÖSTAT Classification 'Statistik Austria')
1105 Computer software (Details)
1128 Supercomputing (Details)
Google Scholar: Search