Understanding Gradient Boosting as a Gradient Descent

2019-06-06 1 minute read

Author:

There are a lot of resources online about gradient boosting, but not many of them explain how gradient boosting relates to gradient descent. This post is an attempt to explain gradient boosting as a (kinda weird) gradient descent.

I’ll assume zero previous knowledge of gradient boosting here, but this post requires a minimal working knowledge of gradient descent.

Let’s get started!

For a given sample \(\mathbf{x}_i\), a gradient boosting regressor yields predictions with the following form:

\[\hat{y}_i = \sum_{m = 1}^{\text{n_iter}} h_m(\mathbf{x}_i),\]

where each \(h_m\) is an instance of a base estimator (often called weak learner, since it usually does not need to be extremely accurate). Since the base estimator is almost always a decision tree, I’ll abusively use the term GBDT (Gradient Boosting Decision Trees) to refer to gradient boosting in general.

Each of the base estimators \(h_m\) isn’t trying to predict the target \(y_i\). Instead, the base estimators are trying to predict gradients. This sum \(\sum_{m = 1}^{\text{n_iter}} h_m(\mathbf{x}_i)\) is actually performing a gradient descent.

Specifically, it’s a gradient descent in a functional space. This is in contrast to what we’re used to in many other machine learning algorithms (e.g. neural networks or linear regression), where gradient descent is instead performed in the parameter space. Let’s review that briefly.

Read the full blog post on Nicolas’ blog: Understanding Gradient Boosting as a gradient descent

Share on

LinkedIn Bluesky Mastodon Facebook

Understanding Gradient Boosting as a Gradient Descent

Share on

You May Also Enjoy

Changes and development of scikit-learn’s developer API

Announcing the launch of the scikit-learn user survey

Chan Zuckerberg Initiative considers scikit-learn an Essential Open Source Software

Interview with Adam Li, scikit-learn Team Member