You might have heard the word bootstrapping used in business or finance parlance to describe the way that a startup was self-funded and built from the ground up by its founders. That meaning of bootstrapping stems from the phrase “pull yourself up by your bootstraps,” meaning to succeed on your own, without help from anyone else. It harkens back to frontier concepts of self-reliance.
However, according to the linguist Benjamin Zimmer, the true origin of the term bootstrapping was as an ironic statement used to imply that someone had claimed to do something that was actually impossible. The claim made was that a man pulled himself up and over a fence by the force of pulling upward on his own bootstraps. From a scientific perspective, it is pretty clear that you can’t actually lift yourself up in the air and over a tall object by simply yanking on your laces.
Luckily, in the context of statistics and data science, bootstrapping means something more specific and possible. Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of that population, using replacement during the sampling process. This relates back to the original phrase because it belies the notion that the sample is only relying on smaller samples of itself to make calculations on, in order to draw conclusions for the larger population.
This method of bootstrapping does require a considerable amount of computational power, but most computers can easily handle that as long as sample sizes and iterations are kept to reasonable proportions. The main benefit is that bootstrapping saves a lot of time during the phase of conducting research when it is too difficult, time-consuming or costly to survey the entire population being looked at.
Replacement and Sampling
In order to better understand bootstrapping, it is helpful to understand what’s meant by replacement and the impact that replacement has on probability. Replacement means that every time an item is drawn from the pool, that same item remains a part of the sample pool that will be drawn from in the next instance. This rule continues to apply for all subsequent samples. If you were to have completely removed the first sample from the sampling pool without placing it back in and then drew the second sample, the items drawn in that sample would not be as likely to occur as the items in the first sample because the overall population would now be smaller. By removing data with each random sampling, the population measured in subsequent samples would continue to shrink.
A relatable example of how sampling with and without replacement works is the game of drawing names out of a hat for raffle prizes. If the person pulling out the names uses replacement by putting a winner’s name back in the hat before drawing names for the next prize, although it is highly unlikely, replacement can allow that same person to win all of the prizes.
In data science, the process is slightly more complex than just drawing names out of a hat. As you draw samples, you will make statistical calculations based on each one and then find the mean of that statistic across all samples. Once you have all of the statistics for each bootstrapped sample, you can plot them to understand the shape of your data and calculate bias, variance, hypothesis testing and confidence intervals. Because each bootstrapped sample represents a randomly chosen subset of the population, we can make inferences about the entire population.
Bootstrap Sampling in Machine Learning
Building and improving machine learning models make up a good portion of the work that data scientists do.
In a similar way to how bootstrapping is used to infer population results from averaged statistical measures calculated on multiple bags of random samples with replacement, it can be used to infer population results of machine learning models trained on random samples with replacement.
When a machine learning model is built based on bootstrapped data, the model is trained on the bootstrapped data and then tested on the out of bag (OOB) data. The OOB is the portion of the original population that has never been selected in any of the random samples. Because the model has not seen this data before, the model’s quality can be accurately assessed by testing it. If the model performs well on this OOB test data, that indicates that it should also perform similarly well on new data that it’s later exposed to.
How often are your samples never chosen? The size of your train set and the size of your test set can be better understood by knowing that it is calculated using the probability of items being chosen in your random samples. When all of your samples have been bagged, that is as many total samples as the size of your dataset, approximately ⅔ or .632 will have been randomly drawn. Therefore, what remains unchosen or OOB is the total, or 1 – 0.632 = 0.368. What remains out of bag, unsampled, is about one-third of your dataset. You might also see this written as 1/e.
Each bootstrapped sample has an equal number of data points as the size of your dataset. For the question of how many times to bootstrap, 1,000 times is often appropriate, and in some cases more can help to find a high level of certainty about the reliability of your statistics. Bootstrapping can also be accomplished with as few as 50 samples.
Implementing Bootstrapping in Python
There are many ways to implement bootstrapping in Python. Resample can be used from the Scikit Learn library. Bootstrapped is a Python library designed specifically for this purpose, and bootstrapping can also be done in Python using pandas. Here is an example of how you can bootstrap a population sample and measure your confidence interval using pandas in Python. The formatted code can be viewed on gist.
[Gist embed code (this it for embedding directly into the website if that’s preferred: <script src=”https://gist.github.com/Rob-wine/525494f4ea9370fe830b630bccb885e4.js”></script> ]def bootstrapped_conf_interval(data, metric, num_runs=1000, conf=.95):
“”” purpose: Calculate confidence interval for model performance (metric)
data: list of data points in a format that is acceptable to get_metric
metric: options include ‘accuracy’, ‘precision’,’recall’ and ‘f1-score’
conf: how certain you want to be about the range of possible values of your metrc
num_runs: int designating how many bootstrapped samples of data you would like
Confidence tuple with floats as entries ex: (.2,.3)
results = 
# get num _runs bootstrapped samples of unlabeled and labeled data
Let’s say you wanted to know how many shoes you have. How would you find out? You would probably just count them—finding out the answer to this question does not require bootstrapping. But what if you wanted to know the average number of shoes that everyone in your office owns?
You work for a big technology company, so your office is large. After speaking to HR, you’re able to conclude that there are 1,000 people total in the building and 200 people on your floor. This is starting to feel a little more feasible to you, and you’re thinking that you could theoretically ask each person how many shoes they have. This may still be impractical and time-consuming, though. Instead, you decide to bootstrap it.
On the first day you survey 50 people (without replacement) and record how many shoes each person has. This is your dataset. Instead of repeating this procedure every day, you take those 50 data points and create a whole lot of bootstrapped samples from them. Each bag (one bootstrapped sample set) has 50 randomly chosen samples, with replacement. Using replacement means that any of them are counted more than once and some are never counted at all. For each of these bags, you measure the mean number of shoes owned.
After you have done this 100 times, you have 100 estimates of the average number of shoes your co-workers own. You can use them to calculate confidence intervals—and be able to conclude something like, “It is 95% likely that the average number of shoes owned by people in this company is between 6 and 10.”
Advantages and Disadvantages of Bootstrapping
Michael R. Chernick, author of the book Bootstrap Methods: A Practitioner’s Guide (second edition, published by Wiley in 2007) said this about bootstrapping:
“A reason for using it is that sometimes you can’t rely on parametric assumptions, and in some situations the bootstrap works better than other non-parametric methods. It can be applied to a wide variety of problems including nonlinear regression, classification, confidence interval estimation, bias estimation, adjustment of p-values and time series analysis to name a few.”
Does not require a large sample size. It can be used on small datasets.
Handles outliers well. According to Udemy Tech Blog on Medium, bootstrapping “allows us to handle outliers without arbitrary cutoffs. We can either rely on the unlikeliness of the outlier being sampled or we could use Bayesian bootstrapping to weight the outlier lower than other values without excluding it altogether.”
Can require a long amount of computation time.
Results it provides cannot be understood to be correct with 100% certainty. There will be a margin of error.
Can fail when distributions are not finite.
Bootstrapping is one of the many methods and techniques that data scientists use. Particularly useful for assessing the quality of a machine learning model, bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of the population, using replacement during the sampling process.
The study of data science will help you acquire a rich understanding of statistical and machine learning concepts. Launch your career in data science by visiting Master’s in Data Science to find out more about related degree programs.