A “Cliff Notes” for Big Data
Following is a list of meta-resources which have been identified by Dr. Kirk Borne in a blog post he wrote at Data Science Central. The list is made up of links clickable through to the resource they identify. I have also included a link to Dr. Borne’s blog post:
Dr. Kirk Borne writes at Data Science Central:
The flood of articles, webinars, and conferences related to Big Data is generating its own “infoglut”. Consequently, it is really helpful when you find resources that summarize many of the latest developments in one place – a sort of “Cliff Notes” for Big Data. Here are six meta-resources that I have found useful, plus one additional collection that I authored:
Dr. Borne’s original blog post: Big Data – Seven Meta-Resources for Best Practices, Lessons Learned, Data Stories, Opportunities, and Insights – Data Science Central
Daniel Gutierrez, Managing Editor of insideBigData.com gives us the following bit of Big Data Humor:
Pie charts are typically deemed the least useful type of plot to data scientists during the Exploratory Data Analysis phase of a machine learning project, but in this case I think it works!
This joke sparked a memory I have of a related bit of humor- this is all my fault… No going back to blame Dan for any of this one:
A son comes home for spring break after having gone away from the family farm the first time for college. He’s in the kitchen at home with his father who is just starting a conversation, “Hello, son! It’s good to see you and have you back home again, even if it is for a short time.” “Thanks, Paw! I’m glad to be home and more glad to have a break from school— it’s a lot of hard work!”, responds the son. “Is it now?” says the father, “What’s your most difficult class?” “Trigonometry,” answers the son, “but, it’s also the most interesting.” “Is that so,” replies the father, “Tell me something interesting that you’ve learned in this class.”
The son taps his temple with his forefinger and searches his memory before replying, “Well, we learned about circles and about Pi ‘r’ squared.” The father shook his head slowly, smirked and quietly scoffed before saying, “Son, I hate to say it, but I think you may be wasting your time at that fancy big college-school. Everyone with sense knows that ‘Pie are squared’ is absolutely wrong! CORNBREAD are square, PIE are round!”
R is a programming language almost exclusively associated with the processing of numbers. Unlike the languages Python, Ruby, Java or C, R is not thought of when there is a processing task to be done involving text type data. This is a shame. R has the capability to process character strings, has fairly robust support for regular expression processing and when combined with it’s inherent statistics capabilities, makes for a very powerful tool to perform text readability analysis, semantic analysis and many other operations not thought about in concert with R.
Following is a list of R learning resources for text and string processing:
- eBook Download: Handling and Processing Strings in R
Gaston Sanchez Is a self described applied statistician who has written a Creative-Commons licensed e-book, Handling and Processing Strings in R. This book is an excellent overview of R’s string handling capabilities from the basics to more advanced topics. If for no other reason, the two chapters on regular expressions make this book a must-read.
This is a link to the post on Gaston’s blog where her describes his motivation for writing the book and gives an overview of its content: Handling and Processing Strings in R | Gaston Sanchez.
R Programming/Text Processing – Wikibooks
Wikibooks is a project hosted by the Wikimedia Foundation. The same organization which hosts Wikipedia. The mission of Wikibooks is to provide a forum for collaboratively writing open-content textbooks. The subjects of books which have been written at Wikibooks range from cooking to clocks.
As it happens, there are a fair number of technical Wikibooks. This one in particular is focused on the R Programming Language and it’s use for Text Processing.
Regular Expressions – Wikibooks
It’s not long after the topic of text processing is raised that the closely related topic of Regular Expressions is brought up. It’s not possible, not is it practical to have a discussion of one without the other, they are that intimately linked. This Wikibook, while not specific to R, has relevance in that it is a comprehensive look at Regular Expressions.
R Journal Article: stringr: modern, consistent string processing
Hadley Wickham is an Assistant Professor of Statistics at Rice University. He is also the author of the stringr package for R. In this R – Journal article, Hadley gives an in depth look at his package.
From the stringr package documentation:
stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.
- Introduction to String Matching and Modification in R Using Regular Expressions
Svetlana Eden Is a Biostatistician at Vandebilt university. Svetlana authored the above paper, “Introduction to String Matching and Modification in R Using Regulfar Expressions”, in which she takes a deep dive into the use of Regular Repressions in R.
This next list of resources, while not specific to string or text processing, are very good resources for getting started with and into using the R programming language:
This last list of resources, again are general to R. These are from a Computerworld series of articles introducing the R language and providing a fairly comprehensive beginners guide to the language. The last item is an enumeration of 60+ additional learning resources including books, articles, tips and tricks: