Master's in Data Science

  • Top Schools
    • 23 Great Schools with Master’s Programs in Data Science
    • 22 Top Schools with Master’s in Information Systems Degrees
    • 25 Top Schools with Master’s in Business Analytics Programs
  • Online Programs
    • Online Data Science Degree Programs
    • 2022 Guide to Online Bachelor’s in Computer Science Degree Programs
    • Online Masters in Business Analytics Programs
    • Online Masters in Information Systems Programs
    • Online Masters in Computer Engineering
    • Online Masters in Computer Science
    • Online Masters in Cybersecurity
    • Online Certificate Programs in Analytics
  • By State
    • Alabama
    • Arizona
    • Arkansas
    • California
    • Colorado
    • Connecticut
    • Delaware
    • Florida
    • Georgia
    • Hawaii
    • Idaho
    • Illinois
    • Indiana
    • Iowa
    • Kansas
    • Kentucky
    • Louisiana
    • Maine
    • Maryland
    • Massachusetts
    • Michigan
    • Minnesota
    • Mississippi
    • Missouri
    • Montana
    • Nebraska
    • Nevada
    • New Hampshire
    • New Jersey
    • New Mexico
    • New York
    • North Carolina
    • North Dakota
    • Ohio
    • Oklahoma
    • Oregon
    • Pennsylvania
    • Rhode Island
    • South Carolina
    • South Dakota
    • Tennessee
    • Texas
    • Utah
    • Vermont
    • Virginia
    • Washington
    • Washington, D.C.
    • West Virginia
    • Wisconsin
  • Related Degrees
    • Data Science Bachelor Degrees
    • Data Science Certificate Programs for 2022
    • Master’s in Accounting Analytics
    • Master’s in Applied Statistics
    • Master’s in Business Analytics
    • Master’s in Business Intelligence
    • Master’s in Geospatial Science & GIS
    • Master’s in Health Informatics
    • Master’s in Library Science
    • Master’s in Public Policy Data Analytics
    • MBA in Analytics/Data Science
    • PhD in Data Science Programs
    • Programs Outside the US
  • Careers
    • Business Analyst
    • Business Analyst Salary
    • Computer Engineer
    • Computer Scientist
    • Data Analyst
    • Data Analyst Salary Guide
    • Data Architect
    • Data Engineer
    • Data Mining Specialist
    • Data Scientist
    • Data Scientist Salary
    • Marketing Analyst
    • Quantitative Analyst
    • Financial Analyst
    • Information Security Analyst
    • Statistician
    • Digital Marketer
  • Online Courses
    • Your Guide for Online Data Science Courses in 2021
    • Online Data Analytics Courses
    • Machine Learning Courses
    • Blockchain Courses
    • Online Digital Marketing Courses
    • FinTech Courses
    • Financial Analysis Courses
    • Cybersecurity Courses
    • Business Analytics Courses
    • Artificial Intelligence Courses
    • UX/UI Courses
  • Bootcamps
    • Data Science Bootcamps
    • Data Analytics Bootcamps
    • Coding Bootcamps
    • Are Coding Bootcamps Worth it?
    • Cybersecurity Bootcamps
    • UX/UI Bootcamps
    • FinTech Bootcamps
    • Digital Marketing Bootcamps
  • Learning
    • What is Data Analytics?
    • What is Business Analytics?
    • What Is Cyber Security?
    • What is Computer Engineering?
    • What is Computer Science?
    • What is FinTech?
    • Best Programming Language to Learn
    • Is Computer Science a Good Major?
    • What Can You Do With a Computer Science Degree?
    • What Is a Neural Network?
    • What is an Information System?
    • Learn Data Science Online
    • Benefits of Business Intelligence Software
    • Computer Science vs. Computer Engineering
    • Cyber Security vs. Computer Science
    • Data Analyst vs Data Scientist
    • Data Analytics vs. Business Analytics
    • Data Science vs. Machine Learning
  • Resources
  • About 2U

How to Deal with Missing Data

Missing data can skew anything for data scientists, from economic analysis to clinical trials. After all, any analysis is only as good as the data. A data scientist doesn’t want to produce biased estimates that lead to invalid results. The concept of missing data is implied in the name: it’s data that is not captured for a variable for the observation in question. Missing data reduces the statistical power of the analysis, which can distort the validity of the results, according to an article in the Korean Journal of Anesthesiology.


Fortunately, there are proven techniques to deal with missing data.

Imputation vs. Removing Data

 When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or the removal of data.

The imputation method develops reasonable guesses for missing data. It’s most useful when the percentage of missing data is low. If the portion of missing data is too high, the results lack natural variation that could result in an effective model.

The other option is to remove data. When dealing with data that is missing at random, related data can be deleted to reduce bias. Removing data may not be the best option if there are not enough observations to result in a reliable analysis. In some situations, observation of specific events or factors may be required.

Before deciding which approach to employ, data scientists must understand why the data is missing.

Missing at Random (MAR)

Missing at Random means the data is missing relative to the observed data. It is not related to the specific missing values. The data is not missing across all observations but only within sub-samples of the data. It is not known if the data should be there; instead, it is missing given the observed data. The missing data can be predicted based on the complete observed data. 

Missing Completely at Random (MCAR) 

In the MCAR situation, the data is missing across all observations regardless of the expected value or other variables. Data scientists can compare two sets of data, one with missing observations and one without. Using a t-test, if there is no difference between the two data sets, the data is characterized as MCAR.

Data may be missing due to test design, failure in the observations or failure in recording observations. This type of data is seen as MCAR because the reasons for its absence are external and not related to the value of the observation.

 

It is typically safe to remove MCAR data because the results will be unbiased. The test may not be as powerful, but the results will be reliable.

Missing Not at Random (MNAR)

The MNAR category applies when the missing data has a structure to it. In other words, there appear to be reasons the data is missing. In a survey, perhaps a specific group of people – say women ages 45 to 55 – did not answer a question. Like MAR, the data cannot be determined by the observed data, because the missing information is unknown. Data scientists must model the missing data to develop an unbiased estimate. Simply removing observations with missing data could result in a model with bias.

Deletion

There are two primary methods for deleting data when dealing with missing data: listwise and dropping variables. 

Listwise

In this method, all data for an observation that has one or more missing values are deleted. The analysis is run only on observations that have a complete set of data. If the data set is small, it may be the most efficient method to eliminate those cases from the analysis. However, in most cases, the data are not missing completely at random (MCAR). Deleting the instances with missing observations can result in biased parameters and estimates and reduce the statistical power of the analysis. 

Pairwise

Pairwise deletion assumes data are missing completely at random (MCAR), but all the cases with data, even those with missing data,  are used in the analysis. Pairwise deletion allows data scientists to use more of the data. However, the resulting statistics may vary because they are based on different data sets. The results may be impossible to duplicate with a complete set of data. 

Dropping Variables

If data is missing for more than 60% of the observations, it may be wise to discard it if the variable is insignificant.

Imputation

When data is missing, it may make sense to delete data, as mentioned above. However, that may not be the most effective option. For example, if too much information is discarded, it may not be possible to complete a reliable analysis. Or there may be insufficient data to generate a reliable prediction for observations that have missing data.

Instead of deletion, data scientists have multiple solutions to impute the value of missing data. Depending why the data are missing, imputation methods can deliver reasonably reliable results. These are examples of single imputation methods for replacing missing data.

Mean, Median and Mode

This is one of the most common methods of imputing values when dealing with missing data. In cases where there are a small number of missing observations, data scientists can calculate the mean or median of the existing observations. However, when there are many missing variables, mean or median results can result in a loss of variation in the data. This method does not use time-series characteristics or depend on the relationship between the variables.

Time-Series Specific Methods 

Another option is to use time-series specific methods when appropriate to impute data. There are four types of time-series data:

  • No trend or seasonality.
  • Trend, but no seasonality.
  • Seasonality, but no trend.
  • Both trend and seasonality.

The time series methods of imputation assume the adjacent observations will be like the missing data. These methods work well when that assumption is valid. However, these methods won’t always produce reasonable results, particularly in the case of strong seasonality.  

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)

These options are used to analyze longitudinal repeated measures data, in which follow-up observations may be missing. In this method, every missing value is replaced with the last observed value. Longitudinal data track the same instance at different points along a timeline. This method is easy to understand and implement. However, this method may introduce bias when data has a visible trend. It assumes the value is unchanged by the missing data. 

Linear Interpolation

Linear interpolation is often used to approximate a value of some function by using two known values of that function at other points. This formula can also be understood as a weighted average. The weights are inversely related to the distance from the end points to the unknown point. The closer point has more influence than the farther point. 

When dealing with missing data, you should use this method in a time series that exhibits a trend line, but it’s not appropriate for seasonal data.

Seasonal Adjustment with Linear Interpolation

When dealing with data that exhibits both trend and seasonality characteristics, use seasonal adjustment with linear interpolation. First you would perform the seasonal adjustment by computing a centered moving average or taking the average of multiple averages – say, two one-year averages – that are offset by one period relative to another. You can then complete data smoothing with linear interpolation as discussed above.

Multiple Imputation

Multiple imputation is considered a good approach for data sets with a large amount of missing data. Instead of substituting a single value for each missing data point, the missing values are exchanged for values that encompass the natural variability and uncertainty of the right values. Using the imputed data, the process is repeated to make multiple imputed data sets. Each set is then analyzed using the standard analytical procedures, and the multiple analysis results are combined to produce an overall result.

The various imputations incorporate natural variability into the missing values, which creates a valid statistical inference. Multiple imputations can produce statistically valid results even when there is a small sample size or a large amount of missing data.

K Nearest Neighbors 

In this method, data scientists choose a distance measure for k neighbors, and the average is used to impute an estimate. The data scientist must select the number of nearest neighbors and the distance metric. KNN can identify the most frequent value among the neighbors and the mean among the nearest neighbors.

Learn More About Data Science

When working as a data scientist, you often will be faced with imperfect data sets. Analyzing data with missing information is an important part of work as a data scientist. Advancing your career in data science can help you learn to tackle these issues and more. 

Share on Facebook Share
Share on TwitterTweet
Share on LinkedIn Share

SPONSORED DATA SCIENCE PROGRAMS

UC Berkeley - Master of Information and Data Science
Sponsored Program
Syracuse University - Master of Science in Applied Data Science
Sponsored Program

SPONSORED ANALYTICS PROGRAMS

American University - Master of Science in Analytics
Sponsored Program
Syracuse University - Master of Science in Business Analytics
Sponsored Program

Online Programs

  • Online Master’s in Data Science Programs
  • Online Master’s in Business Analytics
  • Master’s in Information Systems Online
  • Online Master’s in Computer Science
  • Online Master’s in Computer Engineering
  • Online Master’s in Cybersecurity
  • Graduate Certificates in Data Science Online

Career Profiles

  • Business Analyst
  • Data Analyst
  • Data Architect
  • Data Engineer
  • Data Scientist
  • Marketing Analyst
  • Information Security
  • Quantitative Analyst
  • Statistician

Bootcamps

  • Data Science Bootcamps
  • Data Analytics Bootcamps
  • Coding Bootcamps
  • Cybersecurity Bootcamps
  • UX/UI Bootcamps
  • Fintech Bootcamps
  • Digital Marketing Bootcamps

Online Courses

  • Online Data Science Courses
  • Online Data Analytics Courses
  • Online Machine Learning Courses
  • Online Blockchain Courses
  • Online Digital Marketing Courses
  • Online Financial Analysis Courses
  • Online Cybersecurity Courses
  • Online Business Analytics Courses
  • Online Artificial Intelligence Courses
  • Online UX/UI Courses

Industry Uses

  • Biotechnology
  • Energy
  • Finance
  • Gaming and Hospitality
  • Government
  • Health Care
  • Insurance
  • Internet
  • Manufacturing
  • Pharmaceuticals
  • Retail
  • Telecommunications
  • Travel and Transportation
  • Utilities
  • Food

Data Science Technologies

  • R
  • Python
  • SQL
  • Hadoop
  • Tableau

MastersInDataScience.org is owned and operated by 2U, Inc.
© 2U, Inc. 2022

About 2U | Privacy Policy | Terms of Use | Resources