Working with scikit-learn alongside pandas DataFrames has often been a source of frustration due to the lack of seamless integration between the two. However, by leveraging scikit-learn’s Pipeline functionality, we can simplify this process significantly. In this post, I’ll walk you through building a scikit-learn Pipeline that seamlessly integrates with pandas DataFrames, making your machine learning workflows more efficient and intuitive.
Integrating scikit-learn with Pandas DataFrames
Scikit-learn operates primarily on numpy matrices, which don’t preserve important DataFrame attributes such as feature names and column data types. This lack of integration can make preprocessing and model building cumbersome, especially when dealing with categorical features and missing values.
To address these challenges, we’ll build a Pipeline with the following objectives:
- Apply a ColumnSelector to filter relevant columns from the DataFrame
- Use a TypeSelector to differentiate between numerical, categorical, and boolean features
- Construct a preprocessing Pipeline to handle missing values, encode categorical features, and scale numerical features
- Combine the preprocessing Pipeline with a classifier for model training and evaluation
Example with Churn Dataset
For our demonstration, we’ll use the churn binary classification dataset from the Penn Machine Learning Benchmarks. This dataset contains 5000 observations with 15 numeric features, 2 binary features, and 2 categorical features.
Let’s start by loading the dataset and setting appropriate column data types.
python
# Load dataset and set column data types
df = pmlb.fetch_data('churn', return_X_y=False)
# Define feature columns
x_cols = [c for c in df if c not in ["target", "phone number"]]
binary_features = ["international plan", "voice mail plan"]
categorical_features = ["state", "area code"]
Building the Pipeline Components
1. Column Selector
python
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
2. Type Selector
python
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
3. Preprocessing Pipeline
python
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=x_cols),
FeatureUnion(transformer_list=[
("numeric_features", make_pipeline(
TypeSelector(np.number),
Imputer(strategy="median"),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
Imputer(strategy="most_frequent"),
OneHotEncoder()
)),
("boolean_features", make_pipeline(
TypeSelector("bool"),
Imputer(strategy="most_frequent")
))
])
)
Model Training and Evaluation
python
classifier_pipeline = make_pipeline(
preprocess_pipeline,
SVC(kernel="rbf", random_state=42)
)
param_grid = {
"svc__gamma": [0.1 * x for x in range(1, 6)]
}
classifier_model = GridSearchCV(classifier_pipeline, param_grid, cv=10)
classifier_model.fit(X_train, y_train)
Conclusion
By building a scikit-learn Pipeline with pandas DataFrame-friendly components, we’ve simplified the integration process and created a streamlined workflow for preprocessing and model building. This approach enhances reproducibility, scalability, and readability of machine learning pipelines, ultimately leading to more efficient model development and deployment.