R is a programming language almost exclusively associated with the processing of numbers. Unlike the languages Python, Ruby, Java or C, R is not thought of when there is a processing task to be done involving text type data. This is a shame. R has the capability to process character strings, has fairly robust support for regular expression processing and when combined with it’s inherent statistics capabilities, makes for a very powerful tool to perform text readability analysis, semantic analysis and many other operations not thought about in concert with R.
Following is a list of R learning resources for text and string processing:
- eBook Download: Handling and Processing Strings in R
Gaston Sanchez Is a self described applied statistician who has written a Creative-Commons licensed e-book, Handling and Processing Strings in R. This book is an excellent overview of R’s string handling capabilities from the basics to more advanced topics. If for no other reason, the two chapters on regular expressions make this book a must-read.
This is a link to the post on Gaston’s blog where her describes his motivation for writing the book and gives an overview of its content: Handling and Processing Strings in R | Gaston Sanchez.
Wikibooks is a project hosted by the Wikimedia Foundation. The same organization which hosts Wikipedia. The mission of Wikibooks is to provide a forum for collaboratively writing open-content textbooks. The subjects of books which have been written at Wikibooks range from cooking to clocks.
As it happens, there are a fair number of technical Wikibooks. This one in particular is focused on the R Programming Language and it’s use for Text Processing.
It’s not long after the topic of text processing is raised that the closely related topic of Regular Expressions is brought up. It’s not possible, not is it practical to have a discussion of one without the other, they are that intimately linked. This Wikibook, while not specific to R, has relevance in that it is a comprehensive look at Regular Expressions.
Hadley Wickham is an Assistant Professor of Statistics at Rice University. He is also the author of the stringr package for R. In this R – Journal article, Hadley gives an in depth look at his package.
From the stringr package documentation:
stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.
- Introduction to String Matching and Modification in R Using Regular Expressions
Svetlana Eden Is a Biostatistician at Vandebilt university. Svetlana authored the above paper, “Introduction to String Matching and Modification in R Using Regulfar Expressions”, in which she takes a deep dive into the use of Regular Repressions in R.
This next list of resources, while not specific to string or text processing, are very good resources for getting started with and into using the R programming language:
- Cookbook for R » Cookbook for R
- twotorials by anthony damico
This last list of resources, again are general to R. These are from a Computerworld series of articles introducing the R language and providing a fairly comprehensive beginners guide to the language. The last item is an enumeration of 60+ additional learning resources including books, articles, tips and tricks:
- Beginner’s guide to R: Introduction – Computerworld
- Beginner’s guide to R: Get your data into R – Computerworld
- Beginner’s guide to R: Easy ways to do basic data analysis – Computerworld
- Beginner’s guide to R: Painless data visualization – Computerworld
- Beginner’s guide to R: Syntax quirks you’ll want to know – Computerworld
- 60+ R resources to improve your data skills – Computerworld