What Is Gradient Descent in Deep Learning?

A crucial part of machine learning is refining and improving the optimization algorithm at its core. One of the easiest optimization algorithms to understand and implement is the gradient descent algorithm.

The gradient descent algorithm is a first-order iterative optimization algorithm that finds the local minimum of a function. In other words, it helps to find the lowest point when the data set can’t be calculated analytically, such as with linear algebra.


Rutgers University


Rutgers Data Science Bootcamp

Gain skills needed to analyze data and deliver value to organizations. Complete projects using real data sets from the worlds of finance, healthcare, government, social welfare, and more.

Southern Methodist University


SMU Data Science Boot Camp

Develop concrete, in-demand data skills and learn how to help drive business decisions and solve challenges that companies are facing. No programming experience required.

Northwestern University


Northwestern Data Science and Visualization Boot Camp

Northwestern Data Science and Visualization Bootcamp teaches practical and technical skills in 24 intensive weeks. Students apply their knowledge to hands-on projects that translate directly into work in the field.

University of Southern California


USC Viterbi Data Analytics Boot Camp

Expand your skill set and grow as a data analyst. This program covers the specialized skills to be successful in the field of data in 24 weeks.


To learn more about how gradient descent is used in deep learning, continue reading.

Gradient Descent for Machine Learning

Data scientists implement a gradient descent algorithm in machine learning to minimize a cost function. The algorithm is typically run first with training data, and errors on the predictions are used to update the parameters of a model. This helps to reduce errors in future tests or when it’s live.

For linear regression, the parameter is the coefficient, while in a neural network, the parameter is the weight. The goal is to find the model parameters that minimize the loss of data while ensuring it’s accurate.

The term gradient descent refers to the changes to the model that move it along a slope or gradient in a graph toward the lowest possible error value. Each time the algorithm is run, it moves step-by-step in the direction of the steepest descent, defined by the negative of the gradient.

The size of the steps is known as the learning rate. Larger steps allow for a higher learning rate but may be less precise. A low learning rate will be more precise but time-consuming to run on large datasets.

Gradient descent is most appropriately used when the parameters can’t reach an accurate conclusion through linear calculation and the target must be searched for by an optimization algorithm. Gradient descent also can be much cheaper and faster to find a solution.

Gradient Descent Intuition

Intuition is essential during gradient descent. To begin the process, the machine must estimate or intuit a starting point on the gradient. Different values of coefficients are selected to start the testing. Then the cost of the coefficients is evaluated, and new ones are selected with lower costs to begin the process again.

The machine learns by beginning at the starting point and taking steps toward the lowest point. The machine must use intuition to find the best parameters for the data to minimize loss and maximize accuracy.

Gradient Descent Procedure

The descent procedure starts with initial values for the coefficient or coefficients of the function. The data scientist evaluates the cost of the coefficient by using the descent gradient function.

The process begins by choosing a derivative or slope of the function at a given point. The slope indicates the direction to move the coefficient values to reveal a lower cost on the next iteration. A specified learning rate parameter (alpha) controls how much the coefficients can change on each update, also known as the step or learning rate. The procedure repeats until the cost of the coefficients is as close to 0.0 as possible.

The gradient descent varies in terms of the number of training patterns used to calculate errors. When calculating gradient descent, data scientists choose between different descent configurations to update their model, but each has its trade-offs in accuracy and efficiency.

There are two main types of gradient descent configurations: batch and stochastic. Each type has its pros and cons, and the data scientist must understand the differences to be able to select the best approach for the problem at hand.

Batch Gradient Descent for Machine Learning

In a batch gradient descent, the algorithm calculates the error for each example in the training dataset, but the model is only updated after all the training samples have been run through the algorithm. The model is updated in a group or batch.

Each cycle through the complete training dataset is called a training epoch. In the batch gradient descent, the model is updated at the conclusion of each training epoch.

Each batch is used to evaluate how close the machine learning model fits the estimates that the target function compared to the training dataset. The batch approach is computationally more efficient than the stochastic method.

The lower update frequency means the error gradient is more stable and may offer a more stable convergence on some problems. It’s useful in parallel processing implementation because the calculation of prediction errors and the model updates occur separately. However, with large datasets running the entire batch can be slow. Also, the stable error gradient may lead to premature convergence on a less-than-optimal set of parameters.

There is a variation called the mini-batch gradient descent that divides the training dataset into smaller batches, often between 10 and 1,000 examples selected at random. It’s a compromise between the efficiency of the full batch version and the robustness of the stochastic approach. It is faster because smaller batches are run, and not all training data has to be uploaded into memory to run. However, error results must be accumulated across all training examples.

The mini-batch gradient descent is the most common form used for machine learning.

Stochastic Gradient Descent for Machine Learning

In comparison to the batch approach, the stochastic version calculates error and updates the model for a single random example in the training dataset. The stochastic gradient descent is also called the online machine learning algorithm. Each iteration of the gradient descent uses a single sample and requires a prediction for each iteration. Stochastic gradient descent is often used when there is a lot of data.

Stochastic gradient descent is more computationally intensive because the error is calculated and the model is updated after each instance. Stochastic gradient descent can lead to faster learning for some problems due to the increase in update frequency. The frequent updates also give faster insights into the model’s performance and rate of improvement.

Due to the granularity from updating the model at each step, the model can deliver a more accurate result before reaching convergence. However, despite all the benefits, the process can be affected by a noisy update procedure that makes it hard for the algorithm to arrive at minimum error for the model.

Tips and Tricks for Gradient Descent


University of London


Online BSc Data Science and Business Analytics

The online BSc Data Science and Business Analytics from the University of London, with academic direction from LSE, enables students to build essential technical and critical thinking skills and prepare for careers in data science, analytics and other growing fields – while they work, without relocating.


To use the gradient descent algorithm for machine learning, take advantage of some tips and tricks:

  • Plot Cost vs Time: Collect and plot the cost values calculated by the algorithm for each iteration. If the gradient descent is running well, you will see a decrease in cost in each iteration. If it does not decrease, experiment with a reduced learning rate.
  • Learning Rate: Use a small real value such as 0.1, 0.001 or 0.0001. Try different values for the problem and see which works best.
  • Plot Mean Cost: In the stochastic approach, updates for each training dataset instance may result in a noisy plot of cost over time. Use an average of 10, 100 or 1,000 updates for a better view of the learning trend.

Learn More

master’s degree in data science can grow your skills to help you succeed in the workplace and meet your career goals. Learn more about data science programs and related degrees.