ARIMA is an acronym for “autoregressive integrated moving average.” It’s a model used in statistics and econometrics to measure events that happen over a period of time. The model is used to understand past data or predict future data in a series. It’s used when a metric is recorded in regular intervals, from fractions of a second to daily, weekly or monthly periods. ARIMA is a type of model known as a Box-Jenkins method.
Southern Methodist University
Georgia Institute of Technology
If you want to learn more about ARIMA modeling and how it is used in the study of data science, read more.
Introduction to ARIMA
There are two prominent methods of time series prediction: univariate and multivariate.
- Univariate uses only the previous values in the time series to predict future values.
- Multivariate also uses external variables in addition to the series of values to create the forecast.
The ARIMA model predicts a given time series based on its own past values. It can be used for any nonseasonal series of numbers that exhibits patterns and is not a series of random events. For example, sales data from a clothing store would be a time series because it was collected over a period of time. One of the key characteristics is the data is collected over a series of constant, regular intervals. A modified version can be created to model predictions over multiple seasons.
For a period of multiple seasons, the data must be corrected to account for differences between the seasons. For example, holidays fall on different days of the year, causing a seasonal effect to the data. Sales may be artificially higher or lower depending on where the holiday falls in the calendar. The data scientist must be able to seasonally adjust the data to provide an accurate prediction for future sales.
The ARIMA model is becoming a popular tool for data scientists to employ for forecasting future demand, such as sales forecasts, manufacturing plans or stock prices. In forecasting stock prices, for example, the model reflects the differences between the values in a series rather than measuring the actual values.
Autoregressive Integrated Moving Average Model
To understand ARIMA, it’s helpful to examine the name. The “AR” stands for autoregression, which refers to the model that shows a changing variable that regresses on its own prior or lagged values. In other words, it predicts future values based on past values.
The “I” stands for integrated, which means it observes the difference between static data values and previous values. The goal is to achieve stationary data that is not subject to seasonality. That means the statistical properties of the data series, such as mean, variance and autocorrelation, are constant over time. Data scientists use an Augmented Dickey-Fuller (ADF) test to determine whether the data is stationary.
Finally, “MA” represents the moving average, which is the dependency between an observed value and a residual error from a moving average model applied to previous observations.
An ARIMA model has three component functions: AR (p), the number of lag observations or autoregressive terms in the model; I (d), the difference in the nonseasonal observations; and MA (q), the size of the moving average window. An ARIMA model order is depicted as (p,d,q) with values for the order or number of times the function occurs in running the model. Values of zero are acceptable.
The ARIMA model uses differenced data to make the data stationary, which means there’s a consistency of the data over time. This function removes the effect of trends or seasonality, such as market or economic data.
Seasonality occurs when data exhibits predictable, repeating patterns. It is critical to control for seasonality because it could impact the accuracy of the results.
ARIMA models can be built using seasonal and nonseasonal formats. A seasonal model must take into account the number of events in each season in addition to the autoregressive, differencing and average terms for each season.
ARIMA models can be built in an array of software tools, including Python. Before deciding on an ARIMA model, the data scientist must confirm that the process in question fits the model. If the data is an appropriate fit for the ARIMA model, the data scientist builds the model and trains it on a dataset before inputting live data to develop and plot a forecast.
Shampoo Sales Dataset
Data scientist Jason Brownlee provided a lesson on developing an ARIMA model using data from shampoo sales on GitHub. The dataset described sales of shampoo over a three-year period. The unit is a sales count, and there were 36 observations recorded.
The dataset has a clear trend of seasonality, so it must undergo differencing to make the dataset stationary. In the example, the dataset is from actual sales rather than a simplified training set. The dataset has seasonality or trend errors that must be accounted for in the model.
A training dataset of known values can be split into a training set and a test set to fit the model. The data scientist can train the model and generate predictions to compare against the test set of values.
Rolling Forecast ARIMA Model
A rolling forecast is used to compare time series models. A rolling forecast updates for each period. For example, a 10-day rolling sales forecast updates based on the previous day’s sales.
A rolling forecast ARIMA model is a way to deal with data that is not stationary. The data scientist can plot the rolling mean and the rolling standard deviation to determine if the data series is stationary and also use the ADF test (above) to see if the data is not stationary.
The rolling forecast takes into account the dependence on observations in prior time periods to run the differencing calculations. The data scientist can create a history of all observations, including the training data and new observations.
The data scientist can also calculate a mean squared error (MSE) score to evaluate the accuracy of a model and compare it with other ARIMA models. The MSE measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value. The lower the MSE, the better the model. By using the MSE results, the data scientist can refine the model to a better fit or higher level of accuracy given the available dataset.
Configuring an ARIMA Model
As mentioned previously, the data scientist must first determine if the data is suited to using the ARIMA model. Experts recommend a dataset of at least 50 observations, and 100 is preferred.
There are three basic steps to configure an ARIMA model, according to the Engineering Statistics Handbook:
Identification: You should determine if the data series is stationary or if there is significant seasonality that should be included in the model. Seasonality can be identified through an autocorrelation plot, a seasonal subseries plot or a spectral plot. The plots and summary statistics will help the data scientist understand the amount of differencing and size of lag that will be required.
The Autocorrelation Function (ACF) is used to determine the number of MA(q) terms in the model. It determines the correlation between the observations at the current point in time and all previous points in time. The Partial Autocorrelation Function (PACF) results determine the order of the model or the values for the MA portion of the model. The model order reflects how many times differencing must be used to transform a time series into a stationary series. The ACF and PACF plots are used to check residual time errors in the series.
Estimation: Estimate the parameters of the model using statistical software programs that perform approaches such as nonlinear least squares and maximum likelihood estimation.
Validation: Check the results of the model against assumptions. If there is a high level of seasonality in the data, the ARIMA model may not be the best option. A more sophisticated seasonal ARIMA model may be required. Validation is an iterative process, and the best models are built through a series of trial and error runs.
The process should be repeated until the desired level of fit is achieved on the test datasets.
Two common errors found in the validation or diagnostic stage are overfitting and residual errors. Overfitting is a sign the model is more complex than necessary and has incorporated random noise from the dataset. Residual errors in forecasting can highlight bias in the model that could lead to inaccurate forecasts.
The ability to forecast a time series is becoming a valuable skill in the public and private sectors.
For data scientists, the ARIMA model is a vital tool for providing accurate forecasts across a wide range of disciplines. For example, a manufacturing company uses an ARIMA model to drive business planning, procurement and production goals. Errors in the forecast could cause significant disruption in the supply chain and production activities of the company. Accurate predictions can help lower costs and meet customer expectations with greater efficiency.
The ARIMA model can also be used for climate studies, such as predicting greenhouse gas concentrations or monitoring the weather patterns that could lead to severe storms, according to the Engineering Statistics Handbook.
The ARIMA model is an incredibly valuable tool for aspiring data scientists. If you choose to study a master’s in data science, you will likely encounter the ARIMA model.