What Is Undersampling?

Undersampling is a technique for balancing uneven datasets by retaining all data from the minority class while reducing the size of the majority class. It is one of several techniques data scientists can use to extract more accurate information from originally imbalanced datasets. Though it has disadvantages, such as the loss of potentially important information, it remains a common and important skill for data scientists.

Undersampling vs. Oversampling for Imbalanced Datasets

Many organizations that collect data end up with imbalanced datasets with one section of the data, a class, having significantly more events than another. The difference between two or more classes is a class imbalance, and imbalanced classifications can be slight or severe. For example, the difference between two classes could be 3:1 or the difference could be 1,000:1. When there are numerous classes within a dataset, not just two, there can be considerable differences in the distribution among them.

Data scientists and machine learning models struggle to obtain accurate information from imbalanced data. An analysis of imbalanced datasets can lead to biased results, particularly in machine learning. Without figuring out an appropriate way to balance the information they have, data scientists may not gain much insight into the issue at hand or make actionable recommendations to their employers.

Two techniques data scientists can use to balance datasets are oversampling and undersampling.

Oversampling is appropriate when data scientists lack sufficient information. One class is abundant, or the majority, and the other is rare, or the minority. In oversampling, the scientist increases the number of rare events. The scientist uses a technique to create artificial events. One technique to create artificial events is the synthetic minority oversampling technique (SMOTE).

Undersampling is appropriate when there is plenty of data for an accurate analysis. The data scientist uses all rare events and reduces the number of abundant events to create two equally sized classes. Typically, scientists randomly delete events from the majority class to achieve the same number of events as inthe minority class.

These methods can be used separately or together because one is not better than the other. Which method a data scientist uses depends on the dataset and analysis.

Undersampling Techniques

There are several undersampling techniques data scientists can use.

Random undersampling involves randomly deleting events from the majority class. Instead of getting rid of events at random, data scientists can use some type of reasoning for keeping or getting rid of some data.

There are undersampling methods focused on which events data scientists should keep.

Data scientists can use one of three near-miss undersampling techniques. In the first technique, they keep events from the majority class that have the smallest average distance to the three closest events from the minority class on a scatter plot. In the second, they use events from the majority class that have the smallest average distance to the three furthest events from the minority class on a scatter plot. In the third, they keep a given number of majority-class events for each minority-class event that is closest on the scatter plot.

Some situations call for undersampling using condensed nearest neighbors (CNN). This technique uses a subset of events. The scientist takes the events in a dataset and adds them to the “store” if they cannot be classified correctly based on the current contents of the store. The store includes all of the events in the minority class and only events from the majority class that cannot be classified correctly.

Other methods are focused on which events to delete.

The Tomek Links method, created by Ivan Tomek in 1976, applies two modifications to CNN. One modification is finding events in pairs, one event from the majority class and the other from the minority class, which have the smallest distance from each other on a scatter plot. These are Tomek Links, known as events on the borderline or noise. This method does not remove much of the majority class, which is why it is often combined with other techniques.

Another method is edited nearest neighbors (ENN). In this technique, scientists examine the three nearest neighbors of each event in a scatter plot. When an event is part of the majority class and is misclassified based on its three nearest neighbors, it is removed. When an event is part of the minority class and misclassified based on its three nearest neighbors, its nearest neighbors in the majority class are removed. This technique can also be combined with other undersampling methods.

Data scientists do not have to rely on a single method, and many have explored combinations of these techniques.

One-sided selection (OSS) combines Tomeks Links and CNN. Data scientists first identify and remove Tomeks Links on the class boundary of the majority class. This removes noisy and borderline majority events. Then they use CNN to delete redundant events from the majority class that are not near the decision boundary.

Neighborhood cleaning rule (NCR) combines CNN and ENN. A data scientist removes duplicate events through CNN. Then, they remove noisy or ambiguous events through ENN. This technique is considered less concerned with balancing the two classes, at the expense of the highest-quality data.

Example of Undersampling in Machine Learning

Consider fraud detection in banking. Banking events, such as debit or credit card transactions or money transfers, are predominantly valid. Only a small percentage of events are fraudulent, resulting in an imbalanced dataset. For the bank or credit card company’s machine learning algorithm to draw conclusions about fraudulent transactions, it has to account for this difference. Failing to resample the imbalanced classes would lead algorithms to skew toward the majority. Any conclusions the business draws from the imbalanced dataset are a waste of time, and any actions it takes based on the analysis will not be effective.

If an organization is trying to improve its ability to identify fraudulent events, imbalanced data could skew its labeling toward incorrectly classifying transactions. The business would end up with many false negatives (a transaction labeled lawful when it is fraudulent) and false positives (a transaction deemed fraudulent when it is valid). Both issues affect the bank or credit card company’s customers and could impact its bottom line.

For banks or credit card companies to gain real insight from imbalanced datasets, they use resampling. Undersampling is one resampling method that can be used alone or in conjunction with oversampling. Through an undersampling technique, businesses remove certain events from the majority class, which consists of non-fraudulent transactions. The goal is to create a balanced dataset that reflects the real world and can most accurately detect fraudulent transactions.

The benefit of using machine learning algorithms to detect banking or credit card fraud is that over time, machine learning recognizes new patterns. The models can adapt, which is particularly important as individuals develop new ways to defraud others. But for a business’s machine learning to become more accurate over time, it must have a strong foundation of a balanced dataset. Any machine learning algorithm is only as good as its data, and imbalanced data will inevitably lead to inaccurate results.

Undersampling Advantages and Disadvantages

The main advantage of undersampling is that data scientists can correct for imbalanced data, reducing the risk that their analysis or machine learning algorithm will skew toward the majority. Without resampling, scientists might achieve 90% accuracy with a classification model. On closer inspection, though, they will find that the results are heavily skewed toward the majority class. This is known as the accuracy paradox.

Simply put, the minority events are harder to predict because there are fewer of them. An algorithm has less information to learn from. But resampling via undersampling can correct this issue and make the minority class equal to the majority class for data analysis.

Other advantages of undersampling include lower storage requirements and faster analysis times. Less data means businesses need less storage and time to gain valuable insights.

During undersampling, data scientists or the machine learning algorithm remove data from the majority class. As a result, scientists can lose potentially important information. Think about the difference in the amount of data in the majority vs. minority classes. The ratio could be 500:1, 1,000:1, 100,000:1, or 1,000,000:1. Removing enough majority events to make the majority class the same size as, or similar to, the minority class results in a significant loss of data.

Loss of potentially important data is particularly pronounced with random undersampling, where events are removed without regard for what they are or how useful they might be to the analysis. Data scientists may address this disadvantage by using a thoughtful and informative undersampling technique. They may also help mitigate the loss of potentially important data by combining undersampling and oversampling techniques. This way, they do not just reduce the majority class but also increase the minority class to reach a balanced dataset.

Another disadvantage of undersampling is that the sample of the majority class chosen could be biased. The sample may not accurately reflect the real world, and the analysis results may be inaccurate.

Because of these disadvantages, some scientists might prefer oversampling. It doesn’t lead to any loss of information, and in some cases, may perform better than undersampling. But oversampling isn’t perfect either. Because oversampling often involves replicating minority events, it can lead to overfitting. To balance these issues, certain scenarios might require a combination of both to obtain the most lifelike dataset and accurate results.

Closing thoughts

Undersampling is a useful method for balancing imbalanced datasets. But it is imperative that data scientists consider the information that may be lost through undersampling. Developing a dataset that behaves like the natural world takes time and care. Data scientists consider the various undersampling techniques, as well as combination undersampling and oversampling methods. They experiment with these methods to determine the most effective way to resample their data and obtain the most accurate results for their employers.

Data science and machine learning are growing fields. According to the Bureau of Labor Statistics, employment for data scientists is expected to increase 34% between 2024 and 2034. The 2020 median pay for data scientists was $112,590 per year. Data scientists work in a variety of industries and locations. If you’re interested in becoming a data scientist, you might want to consider researching data science bootcamp or data science master’s degree to find the best educational path for you.

Information last updated: January 2026