LightGBM Demystified: Understanding the Math Behind the Algorithm

Introduction

While working on a demand forecasting project, I had the opportunity to utilize LightGBM, a state-of-the-art machine learning algorithm that has significantly transformed the way we approach predictive modelling tasks. I got fascinated by the speed by which it trains on thousands of data points while maintaining the accuracy other algorithms struggle to achieve. LightGBM, short for Light Gradient Boosting Machine, is an open-source, distributed, high-performance gradient boosting framework developed by Microsoft.

In the world of data science and machine learning, accurately perform predictions is a critical task that can significantly impact business operations and decision-making. Traditional methods often struggle with the intricacies and volume of modern data, leading to suboptimal performance and scalability issues. This is where LightGBM shines, offering a robust solution that combines speed, accuracy, and flexibility.

LightGBM trains faster than Flash :D

In this blog, we will delve into the fundamentals of LightGBM, explore its core features, and understand why it is considered a game-changer in the field of machine learning. Whether you are a seasoned data scientist or a newcomer to the field, this introduction to LightGBM will provide valuable insights into how you can leverage this powerful tool for your own predictive modelling projects.

What is LightGBM?

LightGBM is an open-source, distributed, high-performance gradient boosting framework developed by Microsoft. It is designed to be efficient, scalable, and accurate. The framework utilizes decision trees to enhance model efficiency and minimize memory usage.

LightGBM was developed to overcome the computational inefficiencies of traditional Gradient Boosting Machines (GBM), which involve processing all data instances across all features, resulting in substantial computation. To address this, LightGBM introduces two key techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).

Gradient-based One-Side Sampling (GOSS)

Gradient-based One-Side Sampling (GOSS) is a technique used to enhance the efficiency of LightGBM by selectively retaining data instances with large gradients. The rationale behind this is that instances with larger gradients contribute more to the optimization process, while instances with smaller gradients have less impact.

GOSS addresses computation issue by utilizing gradients to gain valuable insights into the information gain of each instance.

Small Gradient: Indicates that the algorithm has already been trained well on this instance, resulting in a small error.
Large Gradient: Indicates a large error, meaning this instance will provide more information gain if focused on.

It ignores instances with small gradients and concentrate on those with large gradients. However, doing so would alter the data distribution, which could negatively impact the model’s accuracy.

GOSS provides a method to sample data based on gradients while preserving the data distribution.

How it Works:

Sorting: Data instances are sorted according to the absolute value of their gradients.
Selection: The top a \times 100\% instances with the largest gradients are selected.
Random Sampling: From the remaining instances, a random sample of size b \times 100\% is selected.
Re-weighting: The random sample of small gradients is re-weighted by a constant equal to (1-a)/b during the information gain calculation.

Process:

Mathematical Explanation:

GOSS ultimately ensures that the model focuses more on instances causing greater loss (i.e., under-trained instances) while maintaining the overall data distribution, thereby not significantly affecting the accuracy of the learned model.

Exclusive Feature Bundling (EFB)

Datasets with many features often have sparse features, meaning lots of zero values. These sparse features are usually mutually exclusive, meaning they don’t have non-zero values at the same time. For example, in one-hot encoded text data, only one column indicating a specific word is non-zero, and all other columns are zero in a particular row.

Exclusive Feature Bundling (EFB) is a technique that combines these mutually exclusive features into a single feature, reducing the dimensionality. EFB uses a greedy algorithm to bundle these features together. This reduction in dimensionality speeds up the training time of Gradient Boosting Decision Trees (GBDT) without significantly affecting accuracy, as the complexity of creating feature histograms depends on the number of bundles rather than the number of features (and the number of bundles is much less than the number of features).

One challenge with EFB is finding the optimal bundles. Researchers at Microsoft addressed this by converting the bundling problem into a graph coloring problem. In this problem, features are represented as vertices, and edges are added between features that are not mutually exclusive. A greedy algorithm is then used to create bundles.

The algorithm also allows bundling features that rarely have non-zero values simultaneously, known as almost mutually exclusive features.

Another challenge is merging features in a way that allows the original feature values to be extracted from the bundled feature. For instance, in a bundle of three features, we need to identify the values of these three features using the value of the bundled feature.

The histogram-based algorithm in LightGBM creates discrete bins for continuous values. To merge features effectively, exclusive values of features in a bundle are placed in different bins. This is achieved by adding offsets to the original feature values, ensuring each feature’s values can still be distinguished within the bundle.

Mathematical Explanation:

By combining these two techniques, LightGBM achieves significant improvements in both efficiency and scalability:

GOSS ensures that the most informative instances are prioritized during training, thereby speeding up the convergence and reducing memory usage.
EFB reduces the number of features by bundling mutually exclusive ones, decreasing the dimensionality and computational cost.

How does LightGBM save time in splitting samples?

In LightGBM, the best split is defined based on maximizing the gain, which measures the reduction in the loss function after a split. LightGBM employs a histogram-based approach to find the best split efficiently. Here’s a detailed explanation of how the best split is defined and calculated in LightGBM:

Histogram-based Approach

1. Binning: Continuous feature values are discretized into a fixed number of bins (histograms). This step reduces the number of possible splits and speeds up the computation.

2. Histogram Construction: For each feature, construct a histogram where each bin contains the sum of the gradient and the Hessian for the samples that fall into that bin.

3. Split Finding: Evaluate the potential splits using the histograms instead of the raw data.

Gain Calculation

The gain measures the improvement in the loss function resulting from a split. It is calculated as follows:

The histogram algorithm converts each column of eigenvalues into a histogram, generates k bins according to the integer interval, and then places the eigenvalues in the bins of the corresponding interval, so that the bins < < features, thereby reducing memory usage and calculation complexity. Credits

Steps to Determine the Best Split

1. Initialize:

• Compute the total gradient sum (G) and Hessian sum (H) for the current node.

2. For Each Feature:

• Discretize the feature values into bins.

• Construct histograms for gradients and Hessians.

• For each possible split (bin), calculate the gain using the formula above.

3. Select the Best Split:

• Identify the split with the highest gain across all features and bins.

• This split is chosen as the best split for the current node.

How does LightGBM grow trees?

LightGBM grows trees using a unique approach called leaf-wise (or best-first) growth, which differs from the traditional level-wise (or breadth-first) approach used by many other gradient boosting frameworks.

Here’s an overview of how LightGBM grows trees:

In the leaf-wise growth strategy, LightGBM always splits the leaf with the highest loss reduction (gain). This approach can lead to deeper trees with fewer leaves but generally results in better accuracy. Contrary to XGBoost grows them level-wise.

Leaf-wise vs Level-wise growth strategy. Credits

Steps in Leaf-wise Growth:

1. Start with a Single Root Node:

• Begin with all data points in a single root node.

2. Calculate the Potential Splits:

• For each feature, calculate the potential splits and evaluate their gains.

• Gain is calculated as the reduction in loss function (e.g., Mean Squared Error) when the split is made.

3. Select the Best Split:

• Choose the split that provides the highest gain.

• This involves evaluating all potential splits across all features and selecting the one with the maximum gain.

4. Grow the Tree by Splitting the Best Leaf:

• Split the leaf that results in the highest gain, creating two new leaves.

• Update the tree structure accordingly.

5. Repeat the Process:

• Continue to select and split the leaf with the highest gain until a stopping criterion is met (e.g., maximum depth, minimum number of data points in a leaf, or a specified number of leaves).

Stopping Criteria:

• Maximum tree depth

• Minimum data points in a leaf

• Maximum number of leaves

• Minimum gain to split

Gradient boosting methods in LightGBM

LightGBM offers two main boosting techniques: Gradient Boosting Decision Tree (GBDT) and Dropouts meet Multiple Additive Regression Trees (DART).

Gradient Boosting Decision Tree (GBDT)

GBDT is the standard boosting technique used in LightGBM. It builds an ensemble of decision trees sequentially, where each tree is trained to correct the errors of the previous ones.

Gradient boosting decision tree (GBDT). Credits

Key Features:

• Sequential Training: Trees are built one after another, with each new tree aiming to reduce the residual errors of the combined ensemble of all previous trees.

• Gradient-based Optimization: The model uses gradients of the loss function to identify the most significant errors and correct them in subsequent trees.

• Additive Model: The predictions of all trees are added together to form the final prediction.

Process:

1. Initialization: Start with an initial prediction (e.g., the mean for regression).

2. Calculate Residuals: For each data point, compute the residual (the difference between the actual value and the current prediction).

3. Train a Tree: Fit a new decision tree to predict the residuals.

4. Update Model: Update the model by adding the new tree’s predictions, scaled by a learning rate.

5. Repeat: Continue adding trees until the specified number of iterations is reached or another stopping criterion is met.

Mathematical Explanation:

Dropouts meet Multiple Additive Regression Trees (DART)

DART is an extension of GBDT designed to mitigate overfitting by incorporating dropout techniques, similar to those used in neural networks. It randomly drops trees from the ensemble during training, which helps to regularize the model.

DART.Credits

Key Features:

• Random Dropout: During each iteration, a random subset of trees is dropped, and the new tree is trained based on the remaining trees.

• Reduced Overfitting: By dropping trees, DART prevents the model from relying too heavily on any particular tree, leading to better generalization.

• Additive and Dropout: Combines the principles of additive modeling (like GBDT) and dropout regularization (from neural networks).

Process:

1. Initialization: Start with an initial prediction.

2. Calculate Residuals: Compute the residuals for each data point.

3. Random Dropout: Randomly drop a subset of trees from the current ensemble.

4. Train a Tree: Fit a new decision tree to the residuals of the remaining ensemble.

5. Update Model: Add the new tree’s predictions to the current model.

6. Repeat: Continue the process, randomly dropping trees at each iteration, until the specified number of iterations is reached or another stopping criterion is met.

Mathematical Explanation:

LightGBM Core Parameters

LightGBM core parameters are essential settings that influence the behavior and performance of LightGBM models during training. These parameters control various aspects of the model, such as its structure, optimization process, and objective function. Fine-tuning these core parameters is crucial for adapting the model to specific machine learning tasks and achieving optimal performance. Key parameters include learning rate, number of leaves, maximum depth, regularization terms, and optimization strategies.

Examples of Core Parameters:

1. objective: Specifies the loss function to optimize during training. LightGBM supports various objectives such as regression, binary classification, and multiclass classification.

2. task: Defines the task to be performed, either ‘train’ or ‘prediction’. The default value is ‘train’, but it can be set to ‘prediction’ for model inference.

3. num_leaves: Determines the maximum number of leaves in each tree. Higher values allow the model to capture more complex patterns but may lead to overfitting.

4. learning_rate: Controls the step size at each iteration during gradient descent. Lower values result in slower learning but can improve generalization.

5. max_depth: Sets the maximum depth of each tree. Higher values enable the model to capture more intricate interactions but may lead to overfitting.

6. min_data_in_leaf: Specifies the minimum number of data points required to form a leaf node. Higher values help prevent overfitting but may result in underfitting.

7. num_iterations: Specifies the number of iterations (trees) to be performed. The default value is 100.

8. feature_fraction: Controls the fraction of features to consider when building each tree. Randomly selecting a subset of features helps improve model diversity and reduce overfitting.

9. bagging_fraction: Specifies the fraction of data to be used for bagging (sampling data points with replacement) during training. It helps improve model robustness and reduce variance.

10. lambda_l1 and lambda_l2: Regularization parameters that control L1 and L2 regularization, respectively. These parameters penalize large coefficients to prevent overfitting.

11. min_split_gain: Defines the minimum gain required to split a node further. It helps control the tree’s growth and prevents unnecessary splits.

12. categorical_feature: Specifies the categorical features used for training the model.

Conclusion

In conclusion, LightGBM stands out as a remarkable advancement in the field of machine learning, particularly for tasks involving large datasets and complex models. Its innovative approach to gradient boosting, utilizing techniques such as Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), sets it apart from traditional algorithms by significantly enhancing efficiency and scalability.

The mathematical foundations of LightGBM reveal its robustness and adaptability. By focusing on leaf-wise tree growth, leveraging histograms for split finding, and incorporating efficient sampling methods, LightGBM achieves superior performance while mitigating the risk of overfitting. These mathematical techniques ensure that LightGBM can handle high-dimensional data and intricate patterns, making it a go-to choice for practitioners in various domains.

Understanding the math behind LightGBM not only provides insights into its internal workings but also empowers data scientists and machine learning engineers to fine-tune their models for optimal performance. As we continue to push the boundaries of predictive modeling and data analysis, tools like LightGBM will remain integral to our toolkit, driving innovations and enhancing our ability to derive meaningful insights from data.