Feature Selection with a Scikit-Learn Pipeline

I’m a big advocate for scikit-learn’s pipelines, and for good reason. They offer several advantages:

Ensuring reproducibility
Simplifying the export of models to JSON for production deployment
Structuring preprocessing and hyperparameter search to prevent over-optimistic error estimates

However, one major drawback is the lack of seamless integration with certain scikit-learn modules, particularly feature selection. If you’ve encountered the dreaded RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes, you’re not alone.

After extensive research, I’ve found a solution to make feature selection work seamlessly within a scikit-learn pipeline. But before we dive in, here’s some information about my setup:

Python 3.6.4
scikit-learn 0.19.1
pandas 0.22.0

Now, let’s jump into the implementation:

python

from sklearn import feature_selection
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

# Assuming pmlb is installed
from pmlb import fetch_data

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")

We’ll use the 195_auto_price regression dataset from the Penn Machine Learning Benchmarks, consisting of prices for 159 vehicles and 15 numeric features about the vehicles.

python

X, y = fetch_data('195_auto_price', return_X_y=True)

feature_names = (
    fetch_data('195_auto_price', return_X_y=False)
    .drop(labels="target", axis=1)
    .columns
)

Next, we’ll create a pipeline that standardizes features and trains an extremely randomized tree regression model with 250 trees.

python

pipe = Pipeline(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

For feature selection, we’ll use recursive feature elimination (RFE) to select the optimal number of features based on mean squared error (MSE) from 10-fold cross-validation.

python

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

However, the RuntimeError occurs because the Pipeline object doesn’t contain the necessary attributes. To resolve this, we extend the Pipeline class and create a new PipelineRFE class.

python

class PipelineRFE(Pipeline):

    def fit(self, X, y=None, **fit_params):
        super(PipelineRFE, self).fit(X, y, **fit_params)
        self.feature_importances_ = self.steps[-1][-1].feature_importances_
        return self

Now, let’s rerun the code using the PipelineRFE object.

python

pipe = PipelineRFE(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

_ = StratifiedKFold(random_state=42)

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

Finally, we can analyze the selected features and their corresponding cross-validated RMSE scores.

python

selected_features = feature_names[feature_selector_cv.support_].tolist()
selected_features

And there you have it! Feature selection with a scikit-learn pipeline made easy. Now you can confidently incorporate feature selection into your machine learning workflows.