Quotation Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt. 2012. A tm plug-in for distributed text mining in R. Journal of Statistical Software 51 (5): 1-31.




R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of signifficant size.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Journal article
Journal Journal of Statistical Software
Citation Index SCI
WU-Journal-Rating new FIN-A
Language English
Title A tm plug-in for distributed text mining in R
Volume 51
Number 5
Year 2012
Page from 1
Page to 31
URL http://www.google.at/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC4QFjAA&url=http%3A%2F%2Fepub.wu.ac.at%2F3974%2F1%2Fplugin.pdf&ei=NaxWUoz6FsrUtAaGiYD4Dg&usg=AFQjCNEqhFN9EwSQn7-MBHHzamPdkufY3Q&bvm=bv.53899372,d.Yms&cad=rja


Theußl, Stefan (Details)
Hornik, Kurt (Details)
Feinerer, Ingo
Institute for Statistics and Mathematics IN (Details)
Research Institute for Computational Methods FI (Details)
Google Scholar: Search