R: A Refresher
You can call R, as Wikipedia does, “a programming language and software environment for statistical computing and graphics.” But, if you’re familiar with R, you probably know it’s a lot more than that.
Created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand in the 1990s as a statistical platform for their students, open-source R has been extended over the decades by thousands of user-created libraries. So what is R today? It’s many things:
- R is data analysis software: Data scientists, statisticians, and analysts—anyone who needs to make sense of data, really—can use R for statistical analysis, data visualization, and predictive modeling.
- R is a programming language: An object-oriented language created by statisticians, R provides objects, operators, and functions that allow users to explore, model, and visualize data.
- R is an environment for statistical analysis: Standard statistical methods are easy to implement in R, and since much of the cutting-edge research in statistics and predictive modeling is done in R, newly developed techniques are often available in R first.
- R is an open-source software project: R is free and, thanks to years of scrutiny and tinkering by users and developers, has a high standard of quality and numerical accuracy. R’s open interfaces allow it to integrate with other applications and systems.
- R is a community: The R project leadership has grown to include more than 20 leading statisticians and computer scientists from around the world, and thousands of contributors have created add-on packages. With two million users, R boasts a vibrant online community.
R or Python?
Members of the R community are currently embroiled in the latest skirmish in the data science wars: R versus Python.
Detractors say R is not as approachable as Python, that Python is more widely known and more broadly applicable. Advocates counter with lists of R’s field-specific advantages, arguing that “there’s nothing wrong with using a special purpose programming language to implement special purpose problems.”
At the end of the day, neither R nor Python can soundly trump the other language. Which one you should use depends on the particulars of your project (and you!). Thankfully DataCamp unveiled an infographic recently that provides a side-by-side comparison of the two programming languages from a data science and statistics perspective. Check it out.
Data Science Projects That Use R
Once confined almost exclusively to academia, R now has users and advocates across public and private sectors. The programming language/software environment has made inroads into social networking services, financial institutions, and media outlets. Even the city of Chicago. Here’s how some big names are making R work for them.
- Bank of America: While banking analysts have traditionally pored over Excel files, R is increasingly being used for financial modeling, Bank of America Vice President Niall O’Connor told Fast Company, particularly as a visualization tool.
- Chicago: Thanks to an automated system that uses real-time textual analysis implemented with R, the Windy City has a powerful tool for identifying the sources of food poisoning—from tweets!
- Facebook: Data scientists at Facebook use open-source R packages from Hadley Wickham (e.g., ggplot2, dplyr, plyr, and reshape) to explore new data through custom visualizations.
- New York Times: The Gray Lady uses R for interactive features like Election Forecast, data journalism (e.g., mapping wealth distribution), and to visualize data, whether from Mariano Rivera’s baseball career or the Facebook IPO.
- Twitter: Twitter created the BreakoutDetection package for R to monitor user experience on its network.
The Best Add-On Packages for R
Devotees will tell you that the power of R resides in its community of contributors and the diversity of packages they make available for others to use. Here are 10 packages one data scientist wished he had known about sooner:
- sqldf allows you to perform SQL queries on R data frames. If you’ve got basic SQL skills, sqldf will have you data munging in no time.
- forecast makes it easy to fit time series models.
- plyr provides a handful of functions that split a data structure into groups, apply a function on each group, and return the results in a data structure.
- stringr offers a suite of string operators.
- Database drivers (e.g., RMongo, SQLite, RMySQL) allow R to access whatever database you’re using, thus saving you time and copy-pasting effort.
- lubridate facilitates working with dates and times. It’s “one of those magical libraries that just seems to do exactly what you expect it to.”
- ggplot2 makes it easy to produce spiffy plots.
- qcc is a library for statistical quality control.
- reshape2, as the name suggests, does data restructuring: It converts data from wide to long format and vice versa.
- randomForest is a machine learning package that can do supervised or unsupervised learning.
Want more? Browse a comprehensive list of R packages.
Online Courses in R Programming
No matter what your level of R expertise, resources abound. If you’re a seasoned user who just needs to keep pace with new developments, for example, the R Project’s What’s New? page may suffice. If, however, you’re just getting started, here are some online courses that will have you using R for data science in, if not hours, at least weeks or months:
- Try R from Code School: This introduction to R has users clamoring for a follow-up.
- The Analytics Edge from MITx: Devote 10-15 hours to this course for 12 weeks and you’ll learn to implement linear regression, logistic regression, CART, clustering, and data visualization in R.
- R Programming, etc. from DataCamp: DataCamp offers an assortment of R tutorials and data science courses, complete with video lessons and coding challenges.
- Data Analysis with R from Udacity: This two-month, six-hours-per-week course uses R for exploratory data analysis.
- R Programming from Johns Hopkins University via Coursera: This is the second course in a Data Science specialization.
- Courses offered by Data Society: Data Society is the fastest way to learn how to use machine learning methods to get insights out of your data. Their array of data science courses come with engaging videos, downloadable R code templates, real world data sets, lots of exercises and step-by-step companion guides.