Working with scikit-learn alongside pandas DataFrames has often been a source of frustration due to the lack of seamless integration between the two. However, by leveraging scikit-learn’s Pipeline functionality, we can simplify this process significantly. In this post, I’ll walk you through building a scikit-learn Pipeline that seamlessly integrates with pandas DataFrames, making your machine learning workflows more efficient and intuitive.

Integrating scikit-learn with Pandas DataFrames

Scikit-learn operates primarily on numpy matrices, which don’t preserve important DataFrame attributes such as feature names and column data types. This lack of integration can make preprocessing and model building cumbersome, especially when dealing with categorical features and missing values.

To address these challenges, we’ll build a Pipeline with the following objectives:

  1. Apply a ColumnSelector to filter relevant columns from the DataFrame
  2. Use a TypeSelector to differentiate between numerical, categorical, and boolean features
  3. Construct a preprocessing Pipeline to handle missing values, encode categorical features, and scale numerical features
  4. Combine the preprocessing Pipeline with a classifier for model training and evaluation

Example with Churn Dataset

For our demonstration, we’ll use the churn binary classification dataset from the Penn Machine Learning Benchmarks. This dataset contains 5000 observations with 15 numeric features, 2 binary features, and 2 categorical features.

Let’s start by loading the dataset and setting appropriate column data types.

python
# Load dataset and set column data types
df = pmlb.fetch_data('churn', return_X_y=False)
# Define feature columns
x_cols = [c for c in df if c not in ["target", "phone number"]]
binary_features = ["international plan", "voice mail plan"]
categorical_features = ["state", "area code"]

Building the Pipeline Components

1. Column Selector

python
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

2. Type Selector

python
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

3. Preprocessing Pipeline

python
preprocess_pipeline = make_pipeline(
    ColumnSelector(columns=x_cols),
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

Model Training and Evaluation

python
classifier_pipeline = make_pipeline(
    preprocess_pipeline,
    SVC(kernel="rbf", random_state=42)
)

param_grid = {
    "svc__gamma": [0.1 * x for x in range(1, 6)]
}

classifier_model = GridSearchCV(classifier_pipeline, param_grid, cv=10)
classifier_model.fit(X_train, y_train)

Conclusion

By building a scikit-learn Pipeline with pandas DataFrame-friendly components, we’ve simplified the integration process and created a streamlined workflow for preprocessing and model building. This approach enhances reproducibility, scalability, and readability of machine learning pipelines, ultimately leading to more efficient model development and deployment.