Feature importance¶
Feature importance is a technique used in machine learning to determine the relevance or importance of each feature (or variable) in a model's prediction. In other words, it measures how much each feature contributes to the model's output.
Feature importance can be used for several purposes, such as identifying the most relevant features for a specific prediction, understanding the behavior of a model, and selecting the best set of features for a given task. It can also help to identify potential biases or errors in the data used to train the model. It is important to note that feature importance is not a definitive measure of causality. Just because a feature is identified as important does not necessarily mean that it is causing the outcome. Other factors, such as confounding variables, may also be at play.
The method used to calculate feature importance can vary depending on the type of machine learning model being used. Different machine learning models may have different assumptions and characteristics that affect the calculation of feature importance. For example, decision tree-based models like random forest and gradient boosting typically use mean decrease impurity or permutation feature importance methods to calculate feature importance.
Linear regression models typically use coefficients or standardized coefficients to determine feature importance. The magnitude of the coefficient reflects the strength and direction of the relationship between the feature and the target variable.
The importance of the predictors included in a forecaster can be obtained using the method get_feature_importance
. This method accesses the coef_
and feature_importances_
attributes of the internal regressor.
  Warning
The `get_feature_importance` method will only provide values if the forecaster's regressor has either the `coef_` or `feature_importances_` attribute, which are standard in scikit-learn. If your regressor does not follow this naming convention, please consider opening an [issue on GitHub](https://github.com/JoaquinAmatRodrigo/skforecast/issues) and we will strive to include it in future updates.  See also
SHAP values in skforecast modelsLibraries¶
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
Data¶
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
Extract feature importance from trained forecaster¶
# Create and fit forecaster using a RandomForest regressor
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
# Predictors importance
# ==============================================================================
forecaster.get_feature_importance()
feature | importance | |
---|---|---|
0 | lag_1 | 0.530186 |
1 | lag_2 | 0.100529 |
2 | lag_3 | 0.023620 |
3 | lag_4 | 0.070458 |
4 | lag_5 | 0.063155 |
5 | exog_1 | 0.047043 |
6 | exog_2 | 0.165009 |
# Create and fit forecaster using a linear regressor
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster.get_feature_importance()
feature | importance | |
---|---|---|
0 | lag_1 | 0.327688 |
1 | lag_2 | -0.073593 |
2 | lag_3 | -0.152202 |
3 | lag_4 | -0.217106 |
4 | lag_5 | -0.145800 |
5 | exog_1 | 0.379798 |
6 | exog_2 | 0.668162 |
To properly retrieve feature importance in the ForecasterAutoregDirect
model, it is essential to specify the specific model from which to extract the feature importance. This is because ForecasterAutoregDirect
fits one model per step, and each model may have different important features. Therefore, the user needs to explicitly indicate which model's feature importance they want to extract to ensure that the correct features are used.
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoregDirect(
regressor = Ridge(random_state=123),
steps = 10,
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
# Predictors importance of model for step 1
# ==============================================================================
forecaster.get_feature_importance(step=1)
feature | importance | |
---|---|---|
0 | lag_1 | 0.326827 |
1 | lag_2 | -0.055386 |
2 | lag_3 | -0.155098 |
3 | lag_4 | -0.220415 |
4 | lag_5 | -0.138252 |
5 | exog_1 | 0.386103 |
6 | exog_2 | 0.635972 |
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>