MastersinDataScience.org is owned by 2U, LLC, parent company of edX. Our goal is to help learners make confident, informed decisions about their education and career. Some programs shown here are offered by universities that partner with 2U, for which 2U provides marketing and operational support and receives compensation. Other programs shown may be paid advertisements from third parties. Both types of programs are identified with the word AD or Advertisement. We aim to keep information current and accurate. Learn more about edX and our partners.

What Is Exploratory Data Analysis?

Exploratory data analysis, often shortened to EDA, is the process of inspecting, summarizing, and visualizing data before making formal conclusions or building a model. It helps analysts understand what is in a dataset, what might be missing or unreliable, and which patterns or relationships are worth investigating further.

Data may live in spreadsheets, databases, data warehouses, data lakes, survey exports, application logs, or other systems. In a tabular dataset, rows often represent individual records and columns often represent variables, but real-world data is rarely neat at first glance. EDA gives analysts a structured way to check distributions, find outliers, spot data-quality issues, and decide what to do next.

What Is Exploratory Data Analysis? EDA Definition

Exploratory data analysis is an approach to data analysis that uses summary statistics, visualizations, and careful questioning to understand a dataset before moving into modeling, reporting, or decision-making. It is not limited to large datasets. EDA can be useful for a small survey file, a monthly sales spreadsheet, a clinical research dataset, or a large collection of application logs.

The National Institute of Standards and Technology (NIST) describes EDA as an approach to data analysis, not a model, that uses these techniques:

Maximize insights into a dataset.
Uncover underlying structures.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings.

In practical terms, EDA helps analysts avoid jumping too quickly to a conclusion. Instead of assuming a dataset is complete, accurate, or ready for a model, the analyst looks for evidence of missing values, duplicate records, unusual values, skewed distributions, relationships between variables, and signs that additional cleaning or domain context is needed.

EDA is commonly used to:

Understand the structure, size, and fields in a dataset.
Summarize individual variables and trends over time.
Check data quality, including missing values, duplicates, inconsistent labels, and outliers.
Evaluate assumptions before statistical analysis or machine learning.
Explore relationships between variables without treating early findings as final proof.

Why Is Exploratory Data Analysis Important?

EDA matters because most data projects depend on decisions made before a model, dashboard, or report is finalized. Analysts use EDA to understand whether the available data can answer the question being asked and whether any limitations need to be documented.

For students comparing data science programs, EDA is also a helpful example of how data science blends statistics, programming, visualization, and domain knowledge. A model may receive more attention, but exploratory work often determines whether it is built on a sound foundation.

How Does Exploratory Data Analysis Work?

EDA does not follow one fixed checklist, but many projects include a similar sequence of steps:

Clarify the question. Identify what the analysis is supposed to help explain or decide.
Inspect the dataset. Review the number of rows and columns, field names, data types, date ranges, and units of measurement.
Check data quality. Look for missing values, duplicate records, impossible values, inconsistent categories, and other errors.
Summarize variables. Use counts, averages, medians, ranges, percentiles, and frequency tables to understand what each variable contains.
Visualize distributions. Use histograms, box plots, line charts, or density plots to see spread, skew, seasonality, and outliers.
Explore relationships. Use scatterplots, grouped summaries, cross-tabs, heatmaps, or correlation matrices to compare variables.
Document assumptions. Note what the data can and cannot support before moving into modeling, reporting, or recommendations.

University of London

Online BSc Data Science and Business Analytics

The BSc Data Science and Business Analytics programme equips you with essential skills in mathematics, statistics and economics, and prepares you to: Analyse data and draw actionable insights for informed decision-making, leverage mathematical and statistical models to tackle real-world problems; and navigate the intersection of business, management and data.

Example of Exploratory Data Analysis

Exploratory data analysis can be used in many fields, from retail and customer analytics to clinical research and electronic health record analysis. The examples below show how researchers and analysts have used EDA to inspect datasets, identify patterns, understand subgroups, and prepare for later analysis.

EDA in retail sales analysis

In a 2025 conference paper, “Revealing Customer Insights in Retail Store Sales to Drive Growth Through Exploratory Data Analysis,” researchers used EDA on a public retail store dataset to create an annual sales report and help the business better understand its customers. The study used Python and Jupyter Notebook, along with statistical measures and data visualization libraries, to examine sales performance and customer behavior across demographic groups.

The authors describe the dataset as containing customer profile information, basic sales information, and overall business performance data. Their goal was to use exploratory analysis and visualization to support a clearer view of customer behavior and sales performance before planning for the next year.

EDA in a clinical study group

A PLOS ONE study, “Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data,” used EDA to examine a clinical dataset of 515 elderly patients from the PolSenior project. Each patient was described by more than 40 biochemical and socio-geographical attributes.

The researchers used robust data normalization, outlier detection with Mahalanobis distance, robust Mahalanobis distance, hierarchical clustering, and principal component analysis. The analysis showed two clusters separated along sex hormone attributes, so the researchers analyzed male and female patients separately. In the male set, the most optimal partitioning resulted in five subgroups, two of which were related to diseased patients: diabetes and hypogonadism patients. The female set appeared more homogeneous than the male dataset, and the researchers reported no evidence of pathological patient subgroups in the female set.

EDA in telecom customer churn analysis

In a 2025 Frontiers in Artificial Intelligence paper titled “A predictive analytics approach to improve telecom’s customer retention strategy,” researchers examined customer churn in telecommunications. The paper focuses on developing a predictive model to identify customers at risk of leaving a telecom provider, using data analysis and machine learning techniques to support customer retention decisions.

This is a useful EDA example because churn analysis often begins with exploring customer data before modeling. In this study, the researchers describe the goal as gaining deeper insight into customer behavior and helping telecom companies improve service offerings and reduce churn rates. The example can help readers see how exploratory analysis may fit into a broader predictive analytics workflow, where analysts first examine data patterns before building or testing models.

EDA in electronic health records and epidemiology data

A 2024 Nature Medicine article, “An open-source framework for end-to-end analysis of electronic health record data,” introduced ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and electronic health record data.

The authors describe ehrapy as supporting analytical steps from data extraction and quality control to low-dimensional representations and statistical analysis. They demonstrate the framework in six examples, including stratifying patients with unspecified pneumonia into more specific phenotypes, examining biomarkers associated with survival differences among those groups, analyzing cardiovascular risks across different data modalities, reconstructing disease-state trajectories in patients with SARS-CoV-2 using imaging data, and showing how exploratory analysis can detect and mitigate bias in electronic health record data.

Common EDA Techniques and Tools

EDA methods are often grouped in several ways. Some are graphical, using charts or visualizations to help analysts spot patterns. Others are non-graphical, meaning they rely on summaries, tables, counts, or statistical measures. EDA can also be univariate, focusing on one variable at a time, or multivariate, focusing on relationships among two or more variables.

Graphical EDA Techniques

Graphical exploratory data analysis uses visual tools to display data and make patterns easier to see. Common examples include:

Box plots: A box plot summarizes the distribution of a variable using values such as the median, quartiles, and potential outliers. Analysts may use box plots to compare distributions across groups, such as monthly water usage across different neighborhoods or customer spending across product categories.
Heatmaps: A heatmap uses color to represent values in a table or matrix. Heatmaps may be used to compare activity by time of day, day of week, or location, or to show where values are unusually high or low.
Histograms: A histogram groups numeric values into intervals, or bins, to show how often values fall within each range. Analysts may use histograms to examine distributions such as age, income, transaction size, lab values, or product delivery times.
Line graphs: A line graph shows how values change across an ordered sequence, often over time. Line graphs can be useful for exploring sales trends, website traffic, disease rates, inventory levels, or model performance over time.
Bar charts: Compare values across categories. They can be useful for exploring counts, averages, or rates by group, such as sales by region, survey responses by category, or customer churn by plan type.
Scatterplots: A scatterplot displays the relationship between two numeric variables. Analysts may use scatterplots to look for possible relationships, clusters, or unusual observations. A scatterplot can show whether two variables appear related, but it does not prove that one causes the other.

Pictograms and infographics can help communicate findings to a broad audience, but they are usually better for presentation than for detailed exploratory analysis. During EDA, analysts typically need charts that preserve enough detail to support careful review.

Non-Graphical EDA Techniques

Non-graphical EDA uses calculations, tables, and summaries to understand a dataset before formal modeling or reporting. Common examples include:

Summary statistics: Measures such as mean, median, minimum, maximum, standard deviation, and percentiles can help analysts understand the center, spread, and range of a variable.
Missing-value checks: Analysts often review the extent of missing data, which fields are affected, and whether missing values follow a pattern.
Frequency tables: Counts and percentages can show how often categories appear, such as product types, customer segments, diagnoses, or survey responses.
Cross-tabulations: Cross-tabs compare two categorical variables, such as customer churn by subscription plan or survey response by age group.
Correlation checks: Correlation can help analysts examine whether two numeric variables move together, though correlation should not be treated as proof of causation.
Duplicate and data-quality checks: EDA often includes looking for duplicate records, inconsistent labels, impossible values, unusual date ranges, or formatting problems.

These non-graphical steps are especially important because visualizations can be misleading when the underlying data contains errors, missing values, duplicates, or inconsistent definitions.

Common EDA Tools

The tools used for EDA depend on the dataset, the analyst's workflow, and the level of technical work required. Spreadsheets can be useful for smaller datasets and quick reviews. SQL is often used to query databases, filter records, join tables, and create summary tables. Python and R are common choices for more reproducible or code-based analysis.

For example, pandas is an open-source Python library for data structures and data analysis tools, and ggplot2 is a widely used R package for creating data visualizations based on the Grammar of Graphics. Business intelligence and dashboarding tools may also support exploratory work. Apache Superset, for example, describes itself as an open-source platform for data exploration and visualization.

The right tool depends on the task. A spreadsheet may be enough for a small dataset. SQL may be the most direct way to explore a database. Python or R may be better for repeatable analysis, statistical work, machine learning preparation, or larger workflows that need to be documented and reused.

Information last updated: June 2026