XGBoost Can’t Extrapolate

A common pattern that I observe among inexperienced Data Scientists is the following – they often default to XGBoost or a similar Gradient Boosted Model for their problem with any thought as to whether its the right choice for the job.

Given how powerful these methods are, this isn’t the most egregious mistake one can make. However, it’s important to know the strengths and weakness of your models and tailor your choice to the needs of the problem. Let me give one example.

Despite being a machine learning powerhouse, XGBoost has a big weakness: extrapolation. Extrapolation is the ability of a model to make predictions for data that falls outside the range of what it was trained on. This can be a big problem when you’re working with time-series data or when your data generating source changes over time. This is surprisingly common in real world scenarios as the world is ever changing! This weakness to extrapolation is common to all the tree based models that use constant functions between split points e.g Random Forest, XGBoost, LightGBM etc.

But, don’t just trust me. Let me show you via an example.

The Example

Let’s start with a simulated dataset that highlights the problem. We’ll create a problem with a probabilistic target variable and 2 input features.

Here’s a quick look at how the data is set up:

  • Target: Binary outcome (1 = success, 0 = failure)
  • Price: A variable that drifts over time
  • Day: Day number (1 to 50)

We split the data into a training set (first 20 days) and a test set (last 30 days).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb

num_days = 50
observations_per_day = 20
start_p = 0.9
end_p = 0.4

data = []
delta_p = (end_p - start_p) / (num_days - 1)

for day in range(num_days):
    # Simulate daily probability
    current_p = start_p + (day * delta_p)
    # Add some random daily component
    observations = np.random.binomial(1, current_p, observations_per_day)
    # Let's make the price based on past observations to simulate a drift
    price = current_p * 1000
    for obs in observations:
        data.append([day + 1, obs, price])

# Convert to a dataframe
df = pd.DataFrame(data, columns=['Day', 'Target', 'Price'])
df['Intercept'] = 1

Training a Model

We train an XGBoost classifier to predict the probability of success using the input features. After training the model, we’ll predict the probabilities for both the training and test sets and compare. Let’s see how the model does when extrapolating.

train_df = df[df['Day'] <= 20]
test_df = df[df['Day'] > 20]

# Features and labels
X_train = train_df.drop(columns=['Target', 'Price','Day'])
y_train = train_df['Target']
X_test = test_df.drop(columns=['Target', 'Price', 'Day'])
y_test = test_df['Target']

print(X_train.head())

# Train the XGBoost model
model = xgb.XGBClassifier(eval_metric='logloss')
model.fit(X_train, y_train)

# Predict probabilities for the training set
y_train_pred_proba = model.predict_proba(X_train)[:, 1]
train_df['Predicted_Prob'] = y_train_pred_proba

# Predict probabilities for the test set
y_test_pred_proba = model.predict_proba(X_test)[:, 1]
test_df['Predicted_Prob'] = y_test_pred_proba

# Calculate the mean of the target variable and predicted probabilities for each day in both sets
daily_means_train = train_df.groupby('Day')['Target'].mean()
daily_probs_train = train_df.groupby('Day')['Predicted_Prob'].mean()
daily_means_test = test_df.groupby('Day')['Target'].mean()
daily_probs_test = test_df.groupby('Day')['Predicted_Prob'].mean()

Extrapolation Issue

XGBoost excels at fitting data within the range it saw during training, but struggles when predicting for values outside this range. In our case, the model is trained on the first 20 days, and when tested on days 21-30, the predicted probabilities don’t follow the actual trend.

We can visualize this by comparing the predicted probabilities against the actual target values over time. The plot below reveals that the model fits the training data well but starts to diverge when it reaches the test data.

# Plot the mean target variable and predicted probabilities per day
plt.figure(figsize=(10, 6))
plt.plot(daily_means_train.index, daily_means_train.values, marker='o', linestyle='-', color='blue', label='Mean Target (Train)')
plt.plot(daily_probs_train.index, daily_probs_train.values, marker='x', linestyle='--', color='green', label='Mean Predicted Probability (Train)')
plt.plot(daily_means_test.index, daily_means_test.values, marker='o', linestyle='-', color='orange', label='Mean Target (Test)')
plt.plot(daily_probs_test.index, daily_probs_test.values, marker='x', linestyle='--', color='red', label='Mean Predicted Probability (Test)')
plt.title('Mean of Target Variable Per Day and Predicted Probabilities')
plt.xlabel('Day')
plt.ylabel('Mean Target / Predicted Probability')
plt.legend()
plt.grid(True)
plt.show()


XGBoost Can’t Extrapolate Well

Notice how the prediction also remains flat outside of the training range? In many applications that type of extrapolation property would be completely unreasonable.

Why XGBoost Struggles with Extrapolation

XGBoost builds a series of decision trees based on the training data and creates rules that split the feature space. By default, these splits are constants, and outside the training range the constants are simply extended outwards. So, if your training data is non-stationary, constant extrapolation will fail. As you can imagine, there are many many scenarios where this is not appropriate. If you think about it, many of your problems may suffer from this to some degree. Often, things like time based trends and market dynamics are present in ML problems, so make sure to watch out for this trap!

What Can you Do Instead?

There are a variety of ways one can mitigate this issue. Many people simply retrain the model regularly. This ensures that the latest data is always in the training set and the model doesn’t have to extrapolate very much. Beyond that, there are a variety of model choices that can partially or completely extrapolate. On time series problems, people often add carefully constructed feature sets of lags and rolling windows.

Finally, pick a model with properties that match to your problem! Gradient boosting models are the default these days, but you don’t need to use them on every problem.

Leave a comment