Quotation Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt. 2011. Distributed Text Mining in R. Research Report Series, Institute for Statistics and Mathematics, Report 107.




R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Working/discussion paper, preprint
Language English
Title Distributed Text Mining in R
Title of whole publication Research Report Series, Institute for Statistics and Mathematics, Report 107
Year 2011
URL http://epub.wu.ac.at/3034/


Theußl, Stefan (Former researcher)
Hornik, Kurt (Details)
Feinerer, Ingo
Institute for Statistics and Mathematics IN (Details)
Research Institute for Computational Methods FI (Details)
Google Scholar: Search