Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable. In essence, if you have a large set of data that you want to categorize, logistic regression may be able to help.
For example, if you were given a dog and an orange and you wanted to find out whether each of these items was an animal or not, the desired result would be for the dog to end up classified as an animal, and for the orange to be categorized as not an animal. Animal is your target; it is dependent on your data in order to be able to classify the item correctly. In this example, there are only two possible answers (binary logistic regression), animal or not an animal. However, it is also possible to set up your logistic regression with more than two possible categories (multinomial logistic regression).
To dive a little deeper into how your model might attempt to classify these two items directly, let’s consider what else the model would need to know about the items in order to decide where they belong. Other similar aspects of these items would need to be looked at when considering how to classify each item or data point. Aspects, or features, may include color, size, weight, shape, height, volume or amount of limbs. In this way, knowing that an orange’s shape was a circle may help the algorithm to conclude that the orange was not an animal. Similarly, knowing that the orange had zero limbs would help as well.
Logistic regression requires that the dependent variable, in this case whether the item was an animal or not, be categorical. The outcome is either animal or not an animal—there is no range in between. A problem that has a continuous outcome, such as predicting the grade of a student or the fuel tank range of a car, is not a good candidate to use logistic regression. Other options like linear regression may be more appropriate.
While many could easily identify whether an orange is an animal or not—based on previous knowledge of fruit, animals, etc.—the mathematical formula that calculates logistic regression does not have access to this sort of outside information. For this reason, the answers it provides are not definitive; they are probabilistic. The results are calculated based on likelihoods rather than absolute certainties.
There are three main types of logistic regression: binary, multinomial and ordinal. They differ in execution and theory. Binary regression deals with two possible values, essentially: yes or no. Multinomial logistic regression deals with three or more values. And ordinal logistic regression deals with three or more classes in a predetermined order.
Binary logistic regression
Binary logistic regression was mentioned earlier in the case of classifying an object as an animal or not an animal—it’s an either/or solution. There are just two possible outcome answers. This concept is typically represented as a 0 or a 1 in coding. Examples include:
Whether or not to lend to a bank customer (outcomes are yes or no).
Assessing cancer risk (outcomes are high or low).
Will a team win tomorrow’s game (outcomes are yes or no).
Multinomial logistic regression
Multinomial logistic regression is a model where there are multiple classes that an item can be classified as. There is a set of three or more predefined classes set up prior to running the model. Examples include:
Classifying texts into what language they come from.
Predicting whether a student will go to college, trade school or into the workforce.
Does your cat prefer wet food, dry food or human food?
Ordinal logistic regression
Ordinal logistic regression is also a model where there are multiple classes that an item can be classified as; however, in this case an ordering of classes is required. Classes do not need to be proportionate. The distance between each class can vary. Examples include:
Ranking restaurants on a scale of 0 to 5 stars.
Predicting the podium results of an Olympic event.
Assessing a choice of candidates, specifically in places that institute ranked-choice voting.
What Is Logistic Regression Used For?
Here is a more realistic and detailed scenario for when logistic regression might be used:
Logistic regression may be used when predicting whether bank customers are likely to default on their loans. This is a calculation a bank makes when deciding if it will or will not lend to a customer and assessing the maximum amount the bank will lend to those it has already deemed to be creditworthy. In order to make this calculation, the bank will look at several factors. Lend is the target in this logistic regression, and based on the likelihood of default that is calculated, a lender will choose whether to take the risk of lending to each customer.
These factors, also known as features or independent variables, might include credit score, income level, age, job status, marital status, gender, the neighborhood of current residence and educational history.
Logistic regression is also often used for medical research and by insurance companies. In order to calculate cancer risks, researchers would look at certain patient habits and genetic predispositions as predictive factors. To assess whether or not a patient is at a high risk of developing cancer, factors such as age, race, weight, smoking status, drinking status, exercise habits, overall medical history, family history of cancer and place of residence and workplace, accounting for environmental factors, would be considered.
Logistic regression is used in many other fields and is a common tool of data scientists.
As data scientists, one pitfall in statistical analysis to be sure to avoid when selecting which factors to choose for your logistic regression is a high level of correlation between features. If you find, for example, that sourdough bakers who knead their bread more than 9 times out of 10 also allow their loaves to ferment for 24 hours, then there would be no need to include both of these features since they occur at the exact same frequency.
Making Predictions With Logistic Regression
As many of the earlier examples suggest, logistic regression is employed in data science as a supervised machine learning classification model. It can be useful in predicting category trends to within a high range of accuracy. With the example of high risk of cancer versus not high risk of cancer, that prediction could be broken down into more granular categories depending on the researcher’s requirements. As an ordinal logistic regression, it could be changed to high risk of cancer, moderate risk of cancer and low risk of cancer. In this case, low risk of cancer might be set to encapsulate data points that are below 33% risk of cancer, for moderate it might be data points falling in between a 33% and 66% chance of cancer risk, while high risk would then be for cases above 66% risk.
Logistic regression assumptions
Remove highly correlated inputs.
Consider removing outliers in your training set because logistic regression will not give significant weight to them during its calculations.
Does not favor sparse (consisting of a lot of zero values) data.
Logistic regression is a classification model, unlike linear regression.
Logistic Regression vs. Linear Regression
Returning to the example of animal or not animal versus looking at the range or spectrum of possible eye colors is a good starting point in understanding the difference between linear and logistic regression.
While logistic regression is categorical, linear regression is continuous, like lines themselves. According to the Cambridge dictionary, the definition of linear is “consisting of or having to do with lines.” With linear regression, we can make ongoing comparisons and look at questions like how close various blue eye colors are to one specific shade of view. In other words, your target could be sea blue. If it were, abstractly speaking, you would then run your regression against all the other shades of blue and measure their distance in shade or tone from your target sea blue color. Comparing logistic and linear regressions is ultimately a difference in how you sort the data.
Using logistic regression in machine learning, you might look at finding an understanding of which factors will reliably predict students’ test scores for the majority of students in your test sample. Specifically, how likely is test prep to improve SAT scores by a certain percentage. If the linear regression finds on its training set that most people who study for one hour daily boost their scores by 100 points while most people who study for two hours daily boost their score by 200 points and three hours equals 300 points and so on, then it will make the prediction that a certain length of study will increase student scores by a particular number of points. This prediction is derived by drawing a line of best fit through a collection of data points. Some points will exist above or below the line while others will sit directly on top of it. Logistic regression will provide a rate of increase of score based as it exists in relationship to increased study time.
What can be concluded from this logistic regression model’s prediction is that most students who study the above amounts of time will see the corresponding improvements in their scores. However, it’s important to remember that there will be slight variations in results for most students, and a few students will be complete outliers. One student may study for one hour daily and see a 500-point improvement in their score while another student might study for three hours daily and actually see no improvement in their score.
Logistic regression is an algorithm used by professionals in many industries to classify data for several different purposes. From bankers to medical researchers and statisticians to school boards, many who have an interest in being able to better understand their data and better predict trends among their constituents will find logistic regression helpful. It allows scientists and institutions alike to make predictions about future data in advance of that data being available. It works on a majority principle and will not correctly predict outcomes for all items, people or subjects considered. Still, it is quite successful at predicting high odds of accuracy for much of its considered subject group.
If these concepts and capabilities are appealing to you, find out more about the paths available to launch your career in data science and related degree programs by visiting Master’s in Data Science.