Python for Data Science?
Conceived in the late 1980s, Python didn’t make inroads into data science until recently. For a long time, as Tal Yarkoni of UT Austin says, “you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out.”
Now, however, tools for almost every aspect of scientific computing are readily available in Python. (Thanks in part, no doubt, to the $3 million the Defense Advanced Research Projects Agency (DARPA) put toward the development of data analytics and data processing libraries for Python in late 2012.)
Bank of America uses Python to crunch financial data. Facebook turns to the Python library Pandas for its data analysis because it sees the benefit of using one programming language across multiple applications.
“One of the reasons we like to use Pandas is because we like to stay in the Python ecosystem,” Burc Arpat, a quantitative engineering manager at Facebook, told Fast Company in May 2014.
Sponsored Online Master's Programs
Learn MoreUniversity of California, Berkeley
Learn MoreUniversity of Denver
Learn MoreSyracuse University
* GRE waivers are available.
Learn MoreSouthern Methodist University
* GRE waivers available for experienced applicants
Learn MoreUniversity of Dayton
Learn MoreAmerican University
Learn MorePepperdine University
Python or R?
Python’s increased use in data science applications has situated it in opposition to R, a programming language and software environment specifically designed to execute the sorts of data analysis tasks Python can now handle. As speculation mounts about whether one of the languages will eventually replace the other in the data science sphere, individuals have to decide which language to learn or which to use for a specific project.
And this is where an unveiled infographic from DataCamp in 2015 can help. It provides a side-by-side comparison of the two programming languages from a data science and statistics perspective. So the next time you’re debating whether to use R or Python for machine learning, statistics, or the Internet of Things…”Data Science Wars: R vs. Python” to the rescue!
Five Python Libraries for Data Science
While there are many libraries available to perform data analysis in Python, here’s a few to get you started:
- NumPy is fundamental for scientific computing with Python. It supports large, multi-dimensional arrays and matrices and includes an assortment of high-level mathematical functions to operate on these arrays.
- SciPy works with NumPy arrays and provides efficient routines for numerical integration and optimization.
- Pandas, also built on top of NumPy, offers data structures and operations for manipulating numerical tables and time series.
- Matplotlib is a 2D plotting library that can generate such data visualizations as histograms, power spectra, bar charts, and scatterplots with just a few lines of code.
- Built on NumPy, SciPy, and Matplotlib, Scikit-learn is a machine learning library that implements classification, regression, and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, and gradient boosting.
Want more? Here’s a longer list of Python libraries useful for data science applications.
Online Courses in Python for Data Science
Since Python is a general-purpose language, many of the educational resources available are not specific to data science. Here, though, are a handful that are:
- Learning Python for Data Analysis and Visualization from Udemy: Assuming no prior knowledge of Python, this course promises “full understanding of how to program with Python and how to use it in conjunction with scientific computing modules and libraries to analyze data.”
- Data Analysis in Python with Pandas from Udemy: Designed for those with intermediate programming ability, this course guides students to mastery of fundamental data analysis methods in Python.
- CS109 Data Science from Harvard University: This Ivy-league introduction to data science uses Python for all programming assignments and projects. Slides and video lectures are available online free of charge and the IPython notebooks for the course are on GitHub.
- Intro to Python for Data Science from DataCamp: This course focuses specifically on Python for data science, teaching powerful ways to manipulate data and data science tools to start doing analysis right away.
- Data Science: Linear Regression in Python from Udemy: This course introduces students to linear regression, a popular data science technique. Students will learn how to create their own linear regression model in Python and use it to solve real world problems.
- Data Science: Logistic Regression in Python from Udemy: This course begins with the theory of logistic regression, followed by lessons on how to use this popular technique yourself. It’s a great first step for anyone interested in deep learning and neural networks.
- Data Science: Deep Learning in Python from Udemy: Students will learn how to build artificial neural networks, use the popular training method called “backpropagation,” and much more. Great for both students and working professionals!