Data Scientist Foundations: The Hard and Human Skills You Need

On the whole, top companies increasingly expect data scientists to have advanced degrees (M.S. or Ph.D.) in a quantitative field.

It hasn’t always been so –

  • A 2011 EMC survey proudly points out that 40 percent of data scientists had master’s degrees or better. More are needed now.
  • Many data gurus – election forecaster Nate Silver, Moneyball’s Paul DePodesta and Cloudera’s Jeff Hammerbacher among them – only have bachelor’s degrees.

But a one- or two-year M.S. with some solid work experience is quickly becoming the norm.

It doesn’t have to be in data science. The field is currently full of graduates of econometrics, physics, statistics, computer science, applied mathematics and engineering.

The real trick – the one that will land you the dream job or set you on a path to start-up stardom – is assembling a strong set of data skills.

I’ve split these suggested skill sets into two categories:

  1. Hard (i.e., technical)
  2. Human (i.e., interpersonal)

Companies are currently looking for both.

HARD SKILLS

Mathematics (Other Than Statistics)

Basic undergraduate courses typically cover calculus and linear algebra. They’re important to know, but they don’t go far enough for big data applications.

Once you have those “basics” under your belt, it’s time to dig into matrix computation, diffusion geometry, and similar topics in applied mathematics.

As Charles Roe points out:

“Most data mining applications use matrix computations as their fundamental algorithms, so a strong understanding of them is essential.”

Naturally, many of these courses will cross over into …

Statistics

Wherever you go in this field, you need to have a solid grounding in statistics. A verifiable background in statistical analysis (through work or education) is a prerequisite for most data science jobs.

As Michael Sanders explains:

“Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence.”

Which means interviewers are going to be looking for core competencies in statistical tools such as:

  • R
  • SAS
  • SPSS
  • SciPy
  • Stata

Initially written by a couple of Kiwi statisticians, R has become particularly popular in the past few years, especially in start-up environments. It’s also free, which may have something to do with its success.

Check out the R-Project for more.

Programming/Scripting Languages

This one’s becoming a tie-breaker. Being able to write code can and will give you a potentially critical lead in the job hunt. You’ll have the skills to do your own testing of your hypotheses, find workarounds for problems and simply work smarter.

You probably need to learn some of the these programming languages:

  • Python
  • C/C++
  • Java
  • Ruby
  • Perl
  • MATLAB
  • Pig

Python is a good one to have in your tool-belt. In his coverage of a 2013 Predictive Analytics World conference, Derrick Harris had this quote from Sameer Chopra, Orbitz VP of Advanced Analytics:

“If you were to leave today and ask: ‘What specific skills should I learn?’ Python.”

Relational Databases

A solid understanding of SQL-based systems is a must. Learn the fundamentals of database design and management. Get a handle on primary and foreign keys, indexing, querying, normalization, constraints and other basic features.

And keep an eye on NewSQL – “highly-scalable, horizontally-distributed systems” like Cloudera Impala, Clustrix, VoltDB, etc. This is an ongoing effort to combine the power of NoSQL for big, messy data with the rigorous and reliable structure of traditional relational databases.

Distributed Computing Systems

Get used to the Apache product family. Bone up on NoSQL platforms. Be prepared to argue the merits of MongoDB vs. Cassandra. Try out some of the free resources at Big Data University.

Seek hands-on experience with:

  • Hadoop
  • HBase
  • Cassandra
  • MapReduce
  • Hive
  • and all the new packages and approaches that continue to proliferate

Learn to speak the language of caching, sharding and scalability. It’s not that hard, and anyway, data scientists are “sexy” now, remember?

Data Mining

Data miners dig deep into big data sets to discover new and interesting patterns. Cluster analysis, anomaly detection, dependencies – the diamonds in the rough.

Interested? Find a class or ferret out a work mentor. Most data mining syllabi will touch on key concepts, practical applications and common algorithms.

Northwestern’s data course, for example, covers both supervised algorithms (Naïve Bayes, Decision Tree, Neural Network, etc.) and non-supervised (Association Rules, Clustering, etc.).

Above all, stay open to possibilities. Data mining is interdisciplinary by its very nature, drawing from AI and machine learning, statistics and database systems.

Data Modeling

Love it or not, it’s going to come in handy. Even if you’re not creating the models yourself, you’d best know how to parse them and present them to higher-ups, as well as how to tell if you yourself can work with them.

As Sanders says:

“Knowing the difference between a fact table that is put together well and one that is faulty with semi-structured unconstrained keys makes all the difference in how easily you can trust and massage the data you’re trying to capture.”

For those getting into the game, Roe recommends starting with data modeling tools, techniques and methodologies like:

  • ERWin
  • Agile Data Modeling
  • ORM Diagrams
  • UML class diagrams
  • CRC cards
  • Conceptual/logical/physical schema
  • DDL
  • Bachman diagrams
  • Zachman Framework

Predictive Modeling

Every business wishes for a crystal ball so they’ll know what the future holds. Failing that, they’ll settle for a data scientist with outstanding predictive analytics skills.

Predictive chops are a must for employment. Harris goes so far as to class predictive modeling as one of a data scientist’s four core competencies (along with SQL, statistics and programming).

“If you don’t have at least a grounding in these skills, you’re probably not getting through the door, in part because they form a common language that lets people from different backgrounds talk to each other.”

Want to make back some of your education costs?

Test your skills against the best on Kaggle, a crowdsourced platform for data predictions. Companies and organizations regularly award prizes for the best solutions to their predictive-modeling needs.

Machine Learning

If you’re hankering for a paycheck of Silicon Valley proportions, what better place to find it than the world of machine learning? Some may find it a little scary, but really – they’re just machines!

Machine learning is variously defined as the:

  • Ability of a machine to improve its own performance through artificial intelligence
  • Use of computers to develop and improve algorithms
  • Science of getting computers to act without being explicitly programmed

Machine learning, formerly the province of science fiction, is now making a regular appearance in lists of data science job requirements.

You don’t necessarily have to pay for it. Andrew Ng’s free Machine Learning course on Coursera has produced a number of distinguished alumni, including Kaggle winners like Xavier Conort.

Visualization

You can search, scrub and mine data to your heart’s content, but in the end, it all comes down to showcasing your findings in a way that business users will understand.

This can be achieved with visualization tools such as:

  • Flare
  • HighCharts
  • AmCharts
  • D3.js
  • Processing
  • Google Visualization API
  • Raphael.js
  • Tableau

It’s the all-important end step. Always keep in mind: the clearer your findings, the easier the decision, the quicker the outcome, and the higher the praise for all your hours of hard work.

HUMAN SKILLS

Domain Expertise

Perhaps the first commandment of data science is “Know Thy Data.”

That usually means having a deep and abiding interest in your field of expertise (e.g., medicine, government, retail, manufacturing, etc.) and a total understanding of your organization’s data.

How can you cultivate those two desirables?

  • Ask idiotic questions.
  • Become familiar with the systems.
  • Explore the products.
  • Learn how the data is collected and how it’s being used.
  • Get to know the people who are involved in each step of that collection and use.

It does take a lot of time, but that time will help you win or keep the job! As Bill Franks, Teradata Chief Analytics Officer, points out:

“I’ve never heard anyone discuss a data science profile without talking about understanding the business. Again, it’s critical to have the person running the analysis fully understand – and be interested in – why this question is being asked, what the business person would do given the results, and why they would make that decision.”

Creativity and Curiosity

Creative data scientists are optimists. Every day, they look at a hodgepodge of incomplete and messy data, inadequate analytics, faulty methods and models, and seemingly insoluble business problems, and they say:

“I got this!”

Creative data scientists are curious. They aren’t afraid of playing around in unstructured environments, of proceeding by trial and error, of following the white rabbit down the hole.

Creative data scientists experiment. They blend CRM transaction records with traffic reports; they entangle multiple systems and data sets; they hack across a dizzying array of incompatible data sources.

Above all, creative data scientists get hired.

Need proof? As part of it’s hiring process, Netflix asks each of its data science applicants to come up with a framework that solves a problem of the interviewer’s choice.

Storytelling

Once you’ve struck oil in that data, you’ll be responsible for selling your discovery to everyone in the organization.

It’s not always easy:

  • You may be presenting your findings to a room full of people who have absolutely no clue what it is that you do.
  • You may be fighting to persuade them that what you’re suggesting will actually have long-term benefits.
  • You may be struggling to speak to them in a language they’ll understand at all.

Presentations just don’t seem to convey what we want to our audience. So why not tell them a story instead?

Weave a narrative around your data. Show your audience (using those outstanding visualization skills) where they’ve come from, where they are now and where they could be in the future.

If you’re an introvert by nature, don’t be afraid to take a communication course. It looks great on your résumé and – better yet – makes cocktail hour a lot easier to live through.

Project Management

In a commercial environment, your job is to produce actionable, profitable insights from data.

If you’re in a leadership position, that can mean pressure. You’ll have to grasp the wider context, delegate tasks, be conscious of budget and time constraints and play to the strengths of the folks working for you.

It’s a skill that’s honed by practice. If you’re considering masters programs, look for ones that offer team-based internships and practicums, so you can get that practice.

Ethics

As a data scientist, you are guaranteed to be on the front line of future ethics debates:

  • Examining a web user’s data can invade their privacy
  • Monitoring communications has legal implications
  • Placing sensors in household objects raises red flags
  • Hoarding data indiscriminately makes the FTC very unhappy
  • Predictive models can have disastrous results for unlucky individuals

Indeed, in the words of Edith Ramirez, Commissioner of the FTC at the Technology Policy Institute Aspen Forum in 2013, in this brave new world of data science:

“Individuals may be judged not because of what they’ve done, or what they will do in the future, but because inferences or correlations drawn by algorithms suggest they may behave in ways that make them poor credit or insurance risks, unsuitable candidates for employment or admission to schools or other institutions, or unlikely to carry out certain functions.”

These are risks your company can’t afford to ignore. Whatever the project, you’ll need to consider what effect your work will have on the lives of customers and clients.

The Elevator Speech

In the end, it comes to down to a few simple must-haves.

I’ll step aside and let Harris explain what Chris Pouliot, Director of Algorithms and Analytics at Netflix, is really looking for in candidates:

“An advanced degree in a quantitative field; hands-on experience hacking data (ideally using Hive, Pig, SQL or Python); good exploratory analysis skills; the ability to work with engineering teams; and the ability to generate and create algorithms and models rather than relying on out-of-the-box ones.”