Quotation Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt. 2009. Parallel Distributed Text Mining in R. International Federation of Classification Societies (IFCS) 2009 Conference, Dresden, Deutschland, 13.03.-18.03..




During the last decade text mining has become a widely used disci- pline utilizing statistical and machine learning methods, both in academics and for business intelligence. Recently R has gained explicit text mining support via the tm package. This infrastructure provides sophisticated methods for document handling, transformations, filters, and data export (e.g., term-document matrices). However, the steady growth and availability of large data sets poses new challenges for such a text mining framework: the corpora cannot be efficiently processed on a single computer mainly due to memory restrictions. On the other hand, there is now an increasing number of multicore processors, or even high performance computing environments, i.e., distributed and highly integrated computing clusters. We propose techniques to take advantage of high performance computing via adding layers to the tm package which provide parallelism and distributed allocation: In detail we identify parts of tm which are sensitive to speed and performance, break these parts up into suitable building blocks for parallel processing and finally encapsulate the emerging parallelism in a functional programming style. A key factor in large scale text mining is the efficient management of data. Therefore, we show how distributed storage can be utilized to facilitate parallel processing of large data sets. This approach offers us a reliable, flexible, and scalable high performance solution for distributed text mining.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Paper presented at an academic conference or symposium
Language English
Title Parallel Distributed Text Mining in R
Event International Federation of Classification Societies (IFCS) 2009 Conference
Year 2009
Date 13.03.-18.03.
Country Germany
Location Dresden
URL http://www.ifcs2009.de/


Theußl, Stefan (Details)
Hornik, Kurt (Details)
Feinerer, Ingo (TU Wien, Austria)
Institute for Statistics and Mathematics IN (Details)
Research Institute for Computational Methods FI (Details)
Research areas (ÖSTAT Classification 'Statistik Austria')
1100 Mathematics, information technology (Details)
1105 Computer software (Details)
1128 Supercomputing (Details)
Google Scholar: Search