I’m a big advocate for scikit-learn’s pipelines, and for good reason. They offer several advantages:
- Ensuring reproducibility
- Simplifying the export of models to JSON for production deployment
- Structuring preprocessing and hyperparameter search to prevent over-optimistic error estimates
However, one major drawback is the lack of seamless integration with certain scikit-learn modules, particularly feature selection. If you’ve encountered the dreaded RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
, you’re not alone.
After extensive research, I’ve found a solution to make feature selection work seamlessly within a scikit-learn pipeline. But before we dive in, here’s some information about my setup:
- Python 3.6.4
- scikit-learn 0.19.1
- pandas 0.22.0
Now, let’s jump into the implementation:
python
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
# Assuming pmlb is installed
from pmlb import fetch_data
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("darkgrid")
We’ll use the 195_auto_price
regression dataset from the Penn Machine Learning Benchmarks, consisting of prices for 159 vehicles and 15 numeric features about the vehicles.
python
X, y = fetch_data('195_auto_price', return_X_y=True)
feature_names = (
fetch_data('195_auto_price', return_X_y=False)
.drop(labels="target", axis=1)
.columns
)
Next, we’ll create a pipeline that standardizes features and trains an extremely randomized tree regression model with 250 trees.
python
pipe = Pipeline(
[
('std_scaler', preprocessing.StandardScaler()),
("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
]
)
For feature selection, we’ll use recursive feature elimination (RFE) to select the optimal number of features based on mean squared error (MSE) from 10-fold cross-validation.
python
feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)
However, the RuntimeError
occurs because the Pipeline object doesn’t contain the necessary attributes. To resolve this, we extend the Pipeline class and create a new PipelineRFE class.
python
class PipelineRFE(Pipeline):
def fit(self, X, y=None, **fit_params):
super(PipelineRFE, self).fit(X, y, **fit_params)
self.feature_importances_ = self.steps[-1][-1].feature_importances_
return self
Now, let’s rerun the code using the PipelineRFE object.
python
pipe = PipelineRFE(
[
('std_scaler', preprocessing.StandardScaler()),
("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
]
)
_ = StratifiedKFold(random_state=42)
feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)
Finally, we can analyze the selected features and their corresponding cross-validated RMSE scores.
python
selected_features = feature_names[feature_selector_cv.support_].tolist()
selected_features
And there you have it! Feature selection with a scikit-learn pipeline made easy. Now you can confidently incorporate feature selection into your machine learning workflows.