Feature Selection with a Scikit-Learn Pipeline

Trujillo Herman — Mon, 15 Apr 2024 08:26:53 +0000

I’m a big advocate for scikit-learn’s pipelines, and for good reason. They offer several advantages:

Ensuring reproducibility
Simplifying the export of models to JSON for production deployment
Structuring preprocessing and hyperparameter search to prevent over-optimistic error estimates

However, one major drawback is the lack of seamless integration with certain scikit-learn modules, particularly feature selection. If you’ve encountered the dreaded RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes, you’re not alone.

After extensive research, I’ve found a solution to make feature selection work seamlessly within a scikit-learn pipeline. But before we dive in, here’s some information about my setup:

Python 3.6.4
scikit-learn 0.19.1
pandas 0.22.0

Now, let’s jump into the implementation:

python

from sklearn import feature_selection
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

# Assuming pmlb is installed
from pmlb import fetch_data

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")

We’ll use the 195_auto_price regression dataset from the Penn Machine Learning Benchmarks, consisting of prices for 159 vehicles and 15 numeric features about the vehicles.

python

X, y = fetch_data('195_auto_price', return_X_y=True)

feature_names = (
    fetch_data('195_auto_price', return_X_y=False)
    .drop(labels="target", axis=1)
    .columns
)

Next, we’ll create a pipeline that standardizes features and trains an extremely randomized tree regression model with 250 trees.

python

pipe = Pipeline(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

For feature selection, we’ll use recursive feature elimination (RFE) to select the optimal number of features based on mean squared error (MSE) from 10-fold cross-validation.

python

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

However, the RuntimeError occurs because the Pipeline object doesn’t contain the necessary attributes. To resolve this, we extend the Pipeline class and create a new PipelineRFE class.

python

class PipelineRFE(Pipeline):

    def fit(self, X, y=None, **fit_params):
        super(PipelineRFE, self).fit(X, y, **fit_params)
        self.feature_importances_ = self.steps[-1][-1].feature_importances_
        return self

Now, let’s rerun the code using the PipelineRFE object.

python

pipe = PipelineRFE(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

_ = StratifiedKFold(random_state=42)

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

Finally, we can analyze the selected features and their corresponding cross-validated RMSE scores.

python

selected_features = feature_names[feature_selector_cv.support_].tolist()
selected_features

And there you have it! Feature selection with a scikit-learn pipeline made easy. Now you can confidently incorporate feature selection into your machine learning workflows.

The post Feature Selection with a Scikit-Learn Pipeline appeared first on Ramhise.

Adding Dask and Jupyter to a Kubernetes Cluster

Trujillo Herman — Fri, 05 Apr 2024 08:43:34 +0000

Today, we’re diving into setting up Dask and Jupyter on a Kubernetes cluster hosted on AWS. If you haven’t already got a Kubernetes cluster up and running, you might want to check out my previous guide on how to set it up.

Before we start, here’s a handy YouTube tutorial demonstrating the process of adding Dask and Jupyter to an existing Kubernetes cluster, following the steps below:

Step 1: Install Helm

Helm is like the magic wand for managing Kubernetes packages. We’ll kick off by installing Helm. On Mac OS X, it’s as easy as using brew:

bash

brew update && brew install kubernetes-helm
helm init

Once Helm is initialized, you’ll get a confirmation message stating that Tiller (the server-side component of Helm) has been successfully installed into your Kubernetes Cluster.

Step 2: Install Dask

Now, let’s install Dask using Helm charts. Helm charts are curated application definitions specifically tailored for Helm. First, we need to update the known charts channels and then install the stable version of Dask:

bash

helm repo update
helm install stable/dask

Oops! Looks like we’ve hit a snag. Despite having Dask in the stable Charts channels, the installation failed. The error message hints that we need to grant the serviceaccount API permissions. This involves some Kubernetes RBAC (Role-based access control) configurations.

Thankfully, a StackOverflow post provides us with the solution:

bash

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
helm init --service-account tiller --upgrade

Let’s give installing Dask another shot:

bash

helm install stable/dask

Voila! Dask is now successfully installed on our Kubernetes cluster. Helm has assigned the deployment the name “running-newt”. You’ll notice various resources such as pods and services prefixed with “running-newt”. The deployment includes a dask-scheduler, a dask-jupyter, and three dask-worker processes by default.

Also, take note of the default Jupyter password: “dask”. We’ll need it to log in to our Jupyter server later.

Step 3: Obtain AWS DNS Entry

Before we can access our deployed Jupyter server, we need to determine the URL. Let’s list all services in the namespace:

bash

kubectl get services

The EXTERNAL-IP column displays hexadecimal values, representing AWS ELB (Elastic Load Balancer) entries. Match the EXTERNAL-IP to the appropriate load balancer in your AWS console (EC2 -> Load Balancers) to obtain the exposed DNS entry.

Step 4: Access Jupyter Server

Now, fire up your browser and head over to the Jupyter server using the obtained DNS entry. You’ll be prompted to enter the Jupyter password, which, as we remember, is “dask”. And there you have it – you’re all set to explore Dask and Jupyter on your Kubernetes cluster!

The post Adding Dask and Jupyter to a Kubernetes Cluster appeared first on Ramhise.

Interpreting Machine Learning Algorithms

Trujillo Herman — Tue, 02 Apr 2024 08:40:48 +0000

Understanding and interpreting machine learning algorithms can be a challenging task, especially when dealing with nonlinear and non-monotonic response functions. These types of functions can exhibit changes in both positive and negative directions, and their rates of change may vary unpredictably with alterations in independent variables. In such cases, the traditional interpretability measures often boil down to relative variable importance measures, offering limited insights into the inner workings of the model.

However, introducing monotonicity constraints can transform these complex models into more interpretable ones. By imposing monotonicity constraints, we can potentially convert non-monotonic models into highly interpretable ones, which may even meet regulatory requirements.

Variable importance measures, while commonly used, often fall short in providing detailed insights into the directionality of a variable’s impact on the response function. Instead, they merely indicate the magnitude of a variable’s relationship relative to others in the model.

One quote particularly resonates with many data scientists and machine learning practitioners: the realization that understanding a model’s implementation details and validation scores might not suffice to inspire trust in its results among end-users. While technical descriptions and standard assessments like cross-validation and error measures may suffice for some, many practitioners require additional techniques to foster trust and comprehension in machine learning models and their outcomes.

In essence, interpreting machine learning algorithms requires going beyond conventional practices. It involves exploring novel techniques and approaches to enhance understanding and build confidence in the models’ predictions and insights.

The post Interpreting Machine Learning Algorithms appeared first on Ramhise.

Setting Up a Kubernetes Cluster on AWS in 5 Minutes

Trujillo Herman — Thu, 21 Mar 2024 08:38:00 +0000

Creating a Kubernetes cluster on AWS may seem like a daunting task, but with the right guidance, it can be accomplished in just a few minutes. Kubernetes, often described as magic, offers a powerful platform for managing containerized applications at scale. In this simplified guide, we’ll walk through the process of setting up a Kubernetes cluster on AWS.

Before we begin, make sure you have an AWS account and the AWS Command Line Interface installed. You’ll also need to configure the AWS CLI with your access key ID and secret access key.

bash

$ aws configure

Now, let’s install the necessary Kubernetes CLI utilities, kops and kubectl. If you’re on Mac OS X, you can use Homebrew for installation:

bash

brew update && brew install kops kubectl

With the utilities installed, we can proceed to set up the Kubernetes cluster. First, create an S3 bucket to store the state of the cluster:

bash

$ aws s3api create-bucket --bucket your-bucket-name --region your-region

Enable versioning for the bucket to facilitate reverting or recovering previous states:

bash

$ aws s3api put-bucket-versioning --bucket your-bucket-name --versioning-configuration Status=Enabled

Next, set up two environment variables, KOPS_CLUSTER_NAME and KOPS_STATE_STORE, to define the cluster name and the S3 bucket location for storing state:

bash

export KOPS_CLUSTER_NAME=your-cluster-name
export KOPS_STATE_STORE=s3://your-bucket-name

Now, generate the cluster configuration:

bash

$ kops create cluster --node-count=2 --node-size=t2.medium --zones=your-zone

This command creates the cluster configuration and writes it to the specified S3 bucket. You can edit the cluster configuration if needed:

bash

$ kops edit cluster

Once you’re satisfied with the configuration, build the cluster:

bash

$ kops update cluster --name ${KOPS_CLUSTER_NAME} --yes

After a few minutes, validate the cluster to ensure that the master and nodes have launched successfully:

bash

$ kops validate cluster

Finally, verify that the Kubernetes nodes are up and running:

bash

$ kubectl get nodes

Congratulations! You now have a fully functional Kubernetes cluster running on AWS. To further explore the capabilities of Kubernetes, consider deploying applications such as the Kubernetes Dashboard for managing your cluster with ease. Enjoy your journey into the world of Kubernetes!

The post Setting Up a Kubernetes Cluster on AWS in 5 Minutes appeared first on Ramhise.

I Was on a Machine Learning for Geosciences Podcast

Trujillo Herman — Tue, 19 Mar 2024 08:35:00 +0000

I recently had the pleasure of being a guest on a machine learning podcast called Undersampled Radio, and it was a blast! Hosted by Gram Ganssle and Matt Hall, the podcast delved into various topics surrounding the intersection of machine learning and the geosciences, with a particular focus on the oil and gas industry, where I work at Novi Labs.

During the episode, we covered a range of intriguing subjects:

Introduction: Getting to know each other and setting the stage for the conversation.
Austin Deep Learning: Exploring the machine learning scene in Austin, Texas, where the podcast is based.
Overview of Novi Labs: Discussing the role of Novi Labs in leveraging machine learning for the oil and gas sector.
Predicting Oil and Gas Production: Delving into the complexities and challenges of predicting production in the oil and gas industry using machine learning techniques.
Do we need experts?: Considering the role of domain expertise in conjunction with machine learning algorithms.
AI vs Physics Models: Comparing the strengths and weaknesses of artificial intelligence models with traditional physics-based models.
Karpatne paper: Machine Learning for the Geosciences: Reflecting on the insights and implications of the Karpatne paper regarding the application of machine learning in geosciences.
Answering scientific questions with machine learning: Exploring how machine learning can contribute to answering fundamental scientific questions in geosciences.
What to study in school for machine learning: Offering advice for individuals interested in pursuing a career in machine learning, particularly in the geosciences field.
Puzzle: Engaging in a thought-provoking puzzle or challenge.
What we’re currently reading: Sharing recommendations for interesting books or articles related to machine learning and geosciences.

Overall, it was an enriching and enjoyable experience, and I’m grateful to the hosts for their hospitality and thought-provoking questions. If you’re interested in exploring the fascinating world of machine learning in the geosciences, I highly recommend giving Undersampled Radio a listen!

The post I Was on a Machine Learning for Geosciences Podcast appeared first on Ramhise.

Autoencoders with Keras

Trujillo Herman — Fri, 16 Feb 2024 08:31:00 +0000

Autoencoders have become an intriguing tool for data compression, and implementing them in Keras is surprisingly straightforward. In this post, I’ll delve into autoencoders, borrowing insights from the Keras blog by Francois Chollet.

Autoencoders, unlike traditional compression methods like JPEG or MPEG, learn a specific lossy compression based on the data examples provided, rather than relying on broad assumptions about images, sound, or video. They consist of three main components:

Encoding function
Decoding function
Loss function

The encoding and decoding functions are typically neural networks, and they need to be differentiable with respect to the loss function to optimize the parameters effectively.

So, what are autoencoders good for?

Data denoising
Dimension reduction
Data visualization

For data denoising, autoencoders offer a nonlinear alternative to methods like PCA, which is linear. Additionally, dimension reduction is a natural outcome of the lossy compression process, aiding in denoising and pre-training for other machine learning algorithms.

Let’s explore the basics of autoencoders using Keras with the following models:

Simple Autoencoder
Deep Autoencoder
Convolutional Autoencoder
A second Convolutional Autoencoder for denoising images

First, let’s set up our environment and load the MNIST dataset for experimentation:

python

from IPython.display import Image, SVG
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import keras
from keras.datasets import mnist
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, Flatten, Reshape
from keras import regularizers

# Load and scale the MNIST dataset
(x_train, _), (x_test, _) = mnist.load_data()
max_value = float(x_train.max())
x_train = x_train.astype('float32') / max_value
x_test = x_test.astype('float32') / max_value
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

Now, let’s dive into the different types of autoencoders. We’ll start with a Simple Autoencoder.

The post Autoencoders with Keras appeared first on Ramhise.

Building Scikit-Learn Pipelines With Pandas DataFrames

Trujillo Herman — Tue, 09 Jan 2024 08:26:00 +0000

Working with scikit-learn alongside pandas DataFrames has often been a source of frustration due to the lack of seamless integration between the two. However, by leveraging scikit-learn’s Pipeline functionality, we can simplify this process significantly. In this post, I’ll walk you through building a scikit-learn Pipeline that seamlessly integrates with pandas DataFrames, making your machine learning workflows more efficient and intuitive.

Integrating scikit-learn with Pandas DataFrames

Scikit-learn operates primarily on numpy matrices, which don’t preserve important DataFrame attributes such as feature names and column data types. This lack of integration can make preprocessing and model building cumbersome, especially when dealing with categorical features and missing values.

To address these challenges, we’ll build a Pipeline with the following objectives:

Apply a ColumnSelector to filter relevant columns from the DataFrame
Use a TypeSelector to differentiate between numerical, categorical, and boolean features
Construct a preprocessing Pipeline to handle missing values, encode categorical features, and scale numerical features
Combine the preprocessing Pipeline with a classifier for model training and evaluation

Example with Churn Dataset

For our demonstration, we’ll use the churn binary classification dataset from the Penn Machine Learning Benchmarks. This dataset contains 5000 observations with 15 numeric features, 2 binary features, and 2 categorical features.

Let’s start by loading the dataset and setting appropriate column data types.

python

# Load dataset and set column data types
df = pmlb.fetch_data('churn', return_X_y=False)
# Define feature columns
x_cols = [c for c in df if c not in ["target", "phone number"]]
binary_features = ["international plan", "voice mail plan"]
categorical_features = ["state", "area code"]

Building the Pipeline Components

1. Column Selector

python

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

2. Type Selector

python

class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

3. Preprocessing Pipeline

python

preprocess_pipeline = make_pipeline(
    ColumnSelector(columns=x_cols),
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

Model Training and Evaluation

python

classifier_pipeline = make_pipeline(
    preprocess_pipeline,
    SVC(kernel="rbf", random_state=42)
)

param_grid = {
    "svc__gamma": [0.1 * x for x in range(1, 6)]
}

classifier_model = GridSearchCV(classifier_pipeline, param_grid, cv=10)
classifier_model.fit(X_train, y_train)

Conclusion

By building a scikit-learn Pipeline with pandas DataFrame-friendly components, we’ve simplified the integration process and created a streamlined workflow for preprocessing and model building. This approach enhances reproducibility, scalability, and readability of machine learning pipelines, ultimately leading to more efficient model development and deployment.

The post Building Scikit-Learn Pipelines With Pandas DataFrames appeared first on Ramhise.

Bye Bye, Jekyll. Hello, Hugo

Trujillo Herman — Fri, 08 Dec 2023 08:19:00 +0000

I’ve decided to bid farewell to Jekyll after encountering one too many Liquid errors and dealing with the Ruby dependency headache. In its place, I’m embracing Hugo—a swift transition that eliminates any excuses hindering my posting schedule.

Here are some key highlights about Hugo:

Developed using Go.
User-friendly interface.
Lightning-fast performance, with local site rebuilds completing in a mere ~100 milliseconds.
Immediate feedback during the writing process due to its speed.

Transitioning to Hugo from Jekyll was a breeze, and most of the process went smoothly without any major hiccups. Initially, this post served as a way to conduct a smoke test and organize my thoughts on the migration.

Creating a basic Hugo site is straightforward. With Homebrew installed on my Mac, installing Hugo is as simple as:

bash

brew install hugo

Next, let’s set up a minimalist blog and draft our inaugural post:

bash

hugo new site blog-ramhiser
cd blog-ramhiser
hugo new posts/my-first-post.md
echo 'MY FIRST POST' >> content/posts/my-first-post.md

I’m particularly fond of Hugo’s cactus theme for its sleek appearance. Installing Hugo themes via git submodules is hassle-free. After adding the cactus theme, I copied the example config.toml configuration:

bash

git init
git submodule add git@github.com:digitalcraftsman/hugo-cactus-theme.git themes/hugo-cactus-theme
cp themes/hugo-cactus-theme/exampleSite/config.toml .

To preview the blog, simply run:

bash

hugo server

Now, you’ll have a stylish and minimalist site up and running.

Importing from Jekyll to Hugo is seamless and doesn’t require any third-party tools or plugins. It’s a matter of executing a command:

bash

hugo import jekyll ~/jekyll_blog/ ~/hugo_blog/

Moving on from GitHub Pages to Netlify was a decision spurred by recent issues with GitHub Pages, likely stemming from Jekyll. The process of migrating to Netlify was swift and painless, thanks to the helpful documentation:

Create a new GitHub repository for the Hugo blog.
Add a netlify.toml config file specifying Hugo version 0.31.1.
Create a Netlify account.
Set up a new Netlify site linked to the GitHub repo.
Update DNS entries on Namecheap to point to Netlify’s servers.
Request a Let’s Encrypt TLS certificate via Netlify (which only took a few minutes).
Implement HTTP to HTTPS redirection for enhanced security.

With these steps completed, I’m ready to embark on a new blogging journey with Hugo and Netlify at my side.

The post Bye Bye, Jekyll. Hello, Hugo appeared first on Ramhise.

Trending articles Archives - Ramhise