Python for Data Science?
Python is a general-use high-level programming language that bills itself as powerful, fast, friendly, open, and easy to learn. Python “plays well with others” and “runs everywhere” (according to the language’s About page).
Conceived in the late 1980s (and named for comedy group Monty Python), Python didn’t make inroads into data science until recently. For a long time, as Tal Yarkoni of UT Austin says, “you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out.”
Now, however, tools for almost every aspect of scientific computing are readily available in Python. (Thanks in part, no doubt, to the $3 million the Defense Advanced Research Projects Agency (DARPA) put toward the development of data analytics and data processing libraries for Python in late 2012.)
Bank of America uses Python to crunch financial data. The Theoretical Physics Division of Los Alamos National Laboratory chose Python to not only control simulations, but also analyze and visualize data. Facebook turns to the Python library Pandas for its data analysis because it sees the benefit of using one programming language across multiple applications.
“One of the reasons we like to use Pandas is because we like to stay in the Python ecosystem,” Burc Arpat, a quantitative engineering manager at Facebook, told Fast Company in May 2014.
Python or R?
Python’s increased use in data science applications has situated it in opposition to R, a programming language and software environment specifically designed to execute the sorts of data analysis tasks Python can now handle. As speculation mounts about whether one of the languages will eventually replace the other in the data science sphere, individuals have to decide which language to learn or which to use for a specific project.
And this is where a recently unveiled infographic from DataCamp can help. It provides a side-by-side comparison of the two programming languages from a data science and statistics perspective. So the next time you’re debating whether to use R or Python for machine learning, statistics, or the Internet of Things…”Data Science Wars: R vs. Python” to the rescue!
Five Python Libraries for Data Science
While there are many libraries available to perform data analysis in Python, here’s a few to get you started:
- NumPy is fundamental for scientific computing with Python. It supports large, multi-dimensional arrays and matrices and includes an assortment of high-level mathematical functions to operate on these arrays.
- SciPy works with NumPy arrays and provides efficient routines for numerical integration and optimization.
- Pandas, also built on top of NumPy, offers data structures and operations for manipulating numerical tables and time series.
- Matplotlib is a 2D plotting library that can generate such data visualizations as histograms, power spectra, bar charts, and scatterplots with just a few lines of code.
- Built on NumPy, SciPy, and Matplotlib, Scikit-learn is a machine learning library that implements classification, regression, and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, and gradient boosting.
Want more? Here’s a longer list of Python libraries useful for data science applications.
Online Courses in Python for Data Science
Since Python is a general-purpose language, many of the educational resources available are not specific to data science. Here, though, are a handful that are:
- Learning Python for Data Analysis and Visualization from Udemy: Assuming no prior knowledge of Python, this course promises “full understanding of how to program with Python and how to use it in conjunction with scientific computing modules and libraries to analyze data.”
- Data Analysis in Python with Pandas from Udemy: Designed for those with intermediate programming ability, this course guides students to mastery of fundamental data analysis methods in Python.
- Applied Data Science with Python from Udemy: This course provides “extensive, end-to-end coverage of all activities performed in a Data Science project” and aims to equip students to become developers capable of executing full-fledged data science projects. (Some experience with both Python and data analysis is preferred.)
- CS109 Data Science from Harvard University: This Ivy-league introduction to data science uses Python for all programming assignments and projects. Slides and video lectures are available online free of charge and the IPython notebooks for the course are on GitHub.
- Intro to Python for Data Science from DataCamp: This course focuses specifically on Python for data science, teaching powerful ways to manipulate data and data science tools to start doing analysis right away.
- Data Science: Linear Regression in Python from Udemy: This course introduces students to linear regression, a popular data science technique. Students will learn how to create their own linear regression model in Python and use it to solve real world problems.
- Data Science: Logistic Regression in Python from Udemy: This course begins with the theory of logistic regression, followed by lessons on how to use this popular technique yourself. It’s a great first step for anyone interested in deep learning and neural networks.
- Data Science: Deep Learning in Python from Udemy: Students will learn how to build artificial neural networks, use the popular training method called “backpropagation,” and much more. Great for both students and working professionals!