Linear regression is a supervised learning algorithm that compares input (X) and output (Y) variables based on labeled data. It’s used for finding the relationship between the two variables and predicting future results based on past relationships.
For example, a data science student could build a model to predict the grades earned in a class based on the hours that individual students study. The student inputs a portion of a set of known results as training data. The data scientist trains the algorithm by refining its parameters until it delivers results that correspond to the known dataset. The result should be a linear regression equation that can predict future students’ results based on the hours they study.
The equation creates a line, hence the term linear, that best fits the X and Y variables provided. The distance between a point on the graph and the regression line is known as the prediction error. The goal is to create a line that has as few errors as possible.
You may also hear the term “logistic regression.” It’s another type of machine learning algorithm used for binary classification problems using a dataset that’s presented in a linear format. It is used when the dependent variable has two categorical options, which must be mutually exclusive. There are usually multiple independent variables, useful for analyzing complex questions with “either-or” construction.
Simple Linear Regression
Completing a simple linear regression on a set of data results in a line on a plot representing the relationship between the independent variable X and the dependent variable Y. The simple linear regression predicts the value of the dependent variable based on the independent variable.
For example, compare the time of day and temperature. The temperature will increase as the sun rises and decline during sunset. This can be depicted as a straight line on the graph showing how the variables relate over time.
Linear regression is a type of supervised learning algorithm in which the data scientist trains the algorithm using a set of training data with correct outputs. You continue to refine the algorithm until it returns results that meet your expectations. The training data allows you to adjust the equation to return results that fit with the known outcomes.
The Linear Regression Equation
The goal of the linear equation is to end up with the line that best fits the data. That means the total prediction error is as small as possible, depicted on the graph as the shortest distance between each data point and the regression line.
The linear regression equation is the same as the slope formula you may have learned previously in algebra or AP statistics.
To begin, determine if there is a relationship between the two variables. Look at the data in x-y format (i.e., two columns of data: independent and dependent variables). Create a scatterplot with the data. Then you can judge if the data roughly fits a line before you attempt the linear regression equation. The equation will help you find the best-fitting line through the data points on the scatterplot.
In simple linear regression, the predictions of Y when plotted as a function of X form a straight line. If the data is not linear, the line will be curvy through the plotted points.
The basic formula for a regression line is Y’ = bX + A, where Y’ is the predicted score, b is the slope of the line, and A is the Y-intercept.
The correlation coefficient or R-squared value will guide you in determining if the model is fit properly. The R-squared value ranges from 0 to 1.0, denoting zero correlation at the low end (0) and a 100% correlation at the high end (1.0).
Linear regression equation in Excel
In a statistics class, you may learn how to calculate linear regressions by hand. In the professional world, linear regression is typically done using software. One of the most common tools is Microsoft Excel. In Excel 2013, Stephanie Glen offers basic steps to find the regression equation and R-squared value in a simple regression analysis.
- Select the data
- Click on insert tab and select scatter chart
- Insert a plain scatter chart
- Right-click on a dot on the chart
- Select add trend line
- Trend line options
- Display equation on chart
- Display R-squared value on the chart
Excel also offers statistical calculations for linear regression analyses via its free Analysis Toolpak. A tutorial on how to enable that in Excel can be found on this Microsoft 365 support page.
Linear Regression Example
Linear regression is a useful tool for determining which variables have an impact on factors of interest to an organization.
For a real-world example, let’s look at a dataset of high school and college GPA grades for a set of 105 computer science majors from the Online Stat Book. We can start with the assumption that high school GPA scores would correlate with higher university GPA performance. With a linear regression equation, we could predict students’ university GPA based on their high school results.
As we suspected, the scatterplot created with the data shows a strong positive relationship between the two scores. The R-squared value is 0.78, a strong indicator of correlation. This relationship is confirmed visually on the chart as the data points for university GPAs are clustered tightly to the linear regression line based on high school GPA.
Using the linear equation derived from this dataset, we can predict a student with a high school GPA of 3 would have a university GPA of 3.12.
In the world of business, managers use regression analysis of past performance to predict future events. For example, a company’s sales manager believes they sell more products when it rains. Managers gather sales data and rainfall numbers for the past three years. The y-axis is the number of sales or the dependent variable. The x-axis is the total rainfall. The data does show that sales increase when it rains. The regression line represents the relationship between sales data and rainfall data. The data shows that for every inch of rain, the company has experienced five additional sales. However, further analysis is likely necessary to determine the actual factors that increase sales for the company with a high degree of certainty.
Linear Regression Assumptions
It is important for data scientists who use linear regression to understand some of the underlying assumptions of the method. Otherwise, they may draw incorrect conclusions and create faulty predictions that don’t reflect real-world performance.
Four principal assumptions about the data justify the use of linear regression models for prediction or inference of the outcome:
- Linearity and additivity: The expected value of the dependent variable is a straight-line function of the independent variable. The effects of different independent variables are additive to the expected value of the dependent variable.
- Statistical independence: There is no correlation between consecutive errors when using time series data. The observations are independent of each other.
- Homoscedasticity: The errors have a constant variance in time compared to predictions and when compared to any independent variable.
- Normality: For a fixed value of X, Y values are distributed normally.
Data scientists use these assumptions to evaluate models and determine if any data observations will cause problems with the analysis. If the data do not support any of these assumptions, then forecasts rendered from the model may be biased, misleading or, at the very least, inefficient.
Advantages and Disadvantages of Linear Regression
Linear regression models are simple to understand and useful for smaller datasets that aren’t overly complex. For small datasets, they can be calculated by hand.
Simple linear regression is useful for finding a relationship between two continuous variables. The formula reveals a statistical relationship but not a deterministic relationship. In other words, it can express correlation but not causation. It shows how closely the two values are linked but not if one variable caused the other. For example, there’s a high correlation between hours studied and grades on a test. It can’t explain why students might study a given amount of hours or why a certain outcome might happen.
Linear regression models also have some disadvantages. They don’t work efficiently with complicated datasets and are difficult to design for nonlinear data. That’s why data scientists recommend starting with exploratory data analysis to examine the data for linear distribution. If there is not an apparent linear distribution in the chart, other methods should be used.
Multiple Linear Regression
There are two types of linear regression: simple and multiple.
So far, we have focused on becoming familiar with simple linear regression. A simple linear regression relies on a single input variable and its relationship with an output variable. However, a more accurate model might consider multiple inputs rather than one.
Take the GPA example from above. To determine a college student’s GPA, the student’s high school GPA was used as the sole input variable. What if we considered using the number of credits a student takes as another input? Or their age? Or financial assistance?
A combination of multiple inputs like this would lend itself to a multiple linear regression model. The multiple or multivariable linear regression algorithm determines the relationship between multiple input variables and an output variable.
Multiple linear regressions are subject to similar assumptions as linear regression, as well as other assumptions like multicollinearity.
Machine learning uses data to make predictions about future events and outcomes. From recommending what streaming movies to watch next to identifying the most efficient routes for trucks, linear regression models may play a role in shaping future technology. Launch your career in data science by visiting Master’s in Data Science to find out more about related degree programs.