Ramhise

Feature Selection with a Scikit-Learn Pipeline

Trujillo Herman — Mon, 15 Apr 2024 08:26:53 +0000

I’m a big advocate for scikit-learn’s pipelines, and for good reason. They offer several advantages:

Ensuring reproducibility
Simplifying the export of models to JSON for production deployment
Structuring preprocessing and hyperparameter search to prevent over-optimistic error estimates

However, one major drawback is the lack of seamless integration with certain scikit-learn modules, particularly feature selection. If you’ve encountered the dreaded RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes, you’re not alone.

After extensive research, I’ve found a solution to make feature selection work seamlessly within a scikit-learn pipeline. But before we dive in, here’s some information about my setup:

Python 3.6.4
scikit-learn 0.19.1
pandas 0.22.0

Now, let’s jump into the implementation:

python

from sklearn import feature_selection
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

# Assuming pmlb is installed
from pmlb import fetch_data

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")

We’ll use the 195_auto_price regression dataset from the Penn Machine Learning Benchmarks, consisting of prices for 159 vehicles and 15 numeric features about the vehicles.

python

X, y = fetch_data('195_auto_price', return_X_y=True)

feature_names = (
    fetch_data('195_auto_price', return_X_y=False)
    .drop(labels="target", axis=1)
    .columns
)

Next, we’ll create a pipeline that standardizes features and trains an extremely randomized tree regression model with 250 trees.

python

pipe = Pipeline(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

For feature selection, we’ll use recursive feature elimination (RFE) to select the optimal number of features based on mean squared error (MSE) from 10-fold cross-validation.

python

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

However, the RuntimeError occurs because the Pipeline object doesn’t contain the necessary attributes. To resolve this, we extend the Pipeline class and create a new PipelineRFE class.

python

class PipelineRFE(Pipeline):

    def fit(self, X, y=None, **fit_params):
        super(PipelineRFE, self).fit(X, y, **fit_params)
        self.feature_importances_ = self.steps[-1][-1].feature_importances_
        return self

Now, let’s rerun the code using the PipelineRFE object.

python

pipe = PipelineRFE(
    [
        ('std_scaler', preprocessing.StandardScaler()),
        ("ET", ExtraTreesRegressor(random_state=42, n_estimators=250))
    ]
)

_ = StratifiedKFold(random_state=42)

feature_selector_cv = feature_selection.RFECV(pipe, cv=10, step=1, scoring="neg_mean_squared_error")
feature_selector_cv.fit(X, y)

Finally, we can analyze the selected features and their corresponding cross-validated RMSE scores.

python

selected_features = feature_names[feature_selector_cv.support_].tolist()
selected_features

And there you have it! Feature selection with a scikit-learn pipeline made easy. Now you can confidently incorporate feature selection into your machine learning workflows.

The post Feature Selection with a Scikit-Learn Pipeline appeared first on Ramhise.

Adding Dask and Jupyter to a Kubernetes Cluster

Trujillo Herman — Fri, 05 Apr 2024 08:43:34 +0000

Today, we’re diving into setting up Dask and Jupyter on a Kubernetes cluster hosted on AWS. If you haven’t already got a Kubernetes cluster up and running, you might want to check out my previous guide on how to set it up.

Before we start, here’s a handy YouTube tutorial demonstrating the process of adding Dask and Jupyter to an existing Kubernetes cluster, following the steps below:

Step 1: Install Helm

Helm is like the magic wand for managing Kubernetes packages. We’ll kick off by installing Helm. On Mac OS X, it’s as easy as using brew:

bash

brew update && brew install kubernetes-helm
helm init

Once Helm is initialized, you’ll get a confirmation message stating that Tiller (the server-side component of Helm) has been successfully installed into your Kubernetes Cluster.

Step 2: Install Dask

Now, let’s install Dask using Helm charts. Helm charts are curated application definitions specifically tailored for Helm. First, we need to update the known charts channels and then install the stable version of Dask:

bash

helm repo update
helm install stable/dask

Oops! Looks like we’ve hit a snag. Despite having Dask in the stable Charts channels, the installation failed. The error message hints that we need to grant the serviceaccount API permissions. This involves some Kubernetes RBAC (Role-based access control) configurations.

Thankfully, a StackOverflow post provides us with the solution:

bash

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
helm init --service-account tiller --upgrade

Let’s give installing Dask another shot:

bash

helm install stable/dask

Voila! Dask is now successfully installed on our Kubernetes cluster. Helm has assigned the deployment the name “running-newt”. You’ll notice various resources such as pods and services prefixed with “running-newt”. The deployment includes a dask-scheduler, a dask-jupyter, and three dask-worker processes by default.

Also, take note of the default Jupyter password: “dask”. We’ll need it to log in to our Jupyter server later.

Step 3: Obtain AWS DNS Entry

Before we can access our deployed Jupyter server, we need to determine the URL. Let’s list all services in the namespace:

bash

kubectl get services

The EXTERNAL-IP column displays hexadecimal values, representing AWS ELB (Elastic Load Balancer) entries. Match the EXTERNAL-IP to the appropriate load balancer in your AWS console (EC2 -> Load Balancers) to obtain the exposed DNS entry.

Step 4: Access Jupyter Server

Now, fire up your browser and head over to the Jupyter server using the obtained DNS entry. You’ll be prompted to enter the Jupyter password, which, as we remember, is “dask”. And there you have it – you’re all set to explore Dask and Jupyter on your Kubernetes cluster!

The post Adding Dask and Jupyter to a Kubernetes Cluster appeared first on Ramhise.

Interpreting Machine Learning Algorithms

Trujillo Herman — Tue, 02 Apr 2024 08:40:48 +0000

Understanding and interpreting machine learning algorithms can be a challenging task, especially when dealing with nonlinear and non-monotonic response functions. These types of functions can exhibit changes in both positive and negative directions, and their rates of change may vary unpredictably with alterations in independent variables. In such cases, the traditional interpretability measures often boil down to relative variable importance measures, offering limited insights into the inner workings of the model.

However, introducing monotonicity constraints can transform these complex models into more interpretable ones. By imposing monotonicity constraints, we can potentially convert non-monotonic models into highly interpretable ones, which may even meet regulatory requirements.

Variable importance measures, while commonly used, often fall short in providing detailed insights into the directionality of a variable’s impact on the response function. Instead, they merely indicate the magnitude of a variable’s relationship relative to others in the model.

One quote particularly resonates with many data scientists and machine learning practitioners: the realization that understanding a model’s implementation details and validation scores might not suffice to inspire trust in its results among end-users. While technical descriptions and standard assessments like cross-validation and error measures may suffice for some, many practitioners require additional techniques to foster trust and comprehension in machine learning models and their outcomes.

In essence, interpreting machine learning algorithms requires going beyond conventional practices. It involves exploring novel techniques and approaches to enhance understanding and build confidence in the models’ predictions and insights.

The post Interpreting Machine Learning Algorithms appeared first on Ramhise.

Setting Up a Kubernetes Cluster on AWS in 5 Minutes

Trujillo Herman — Thu, 21 Mar 2024 08:38:00 +0000

Creating a Kubernetes cluster on AWS may seem like a daunting task, but with the right guidance, it can be accomplished in just a few minutes. Kubernetes, often described as magic, offers a powerful platform for managing containerized applications at scale. In this simplified guide, we’ll walk through the process of setting up a Kubernetes cluster on AWS.

Before we begin, make sure you have an AWS account and the AWS Command Line Interface installed. You’ll also need to configure the AWS CLI with your access key ID and secret access key.

bash

$ aws configure

Now, let’s install the necessary Kubernetes CLI utilities, kops and kubectl. If you’re on Mac OS X, you can use Homebrew for installation:

bash

brew update && brew install kops kubectl

With the utilities installed, we can proceed to set up the Kubernetes cluster. First, create an S3 bucket to store the state of the cluster:

bash

$ aws s3api create-bucket --bucket your-bucket-name --region your-region

Enable versioning for the bucket to facilitate reverting or recovering previous states:

bash

$ aws s3api put-bucket-versioning --bucket your-bucket-name --versioning-configuration Status=Enabled

Next, set up two environment variables, KOPS_CLUSTER_NAME and KOPS_STATE_STORE, to define the cluster name and the S3 bucket location for storing state:

bash

export KOPS_CLUSTER_NAME=your-cluster-name
export KOPS_STATE_STORE=s3://your-bucket-name

Now, generate the cluster configuration:

bash

$ kops create cluster --node-count=2 --node-size=t2.medium --zones=your-zone

This command creates the cluster configuration and writes it to the specified S3 bucket. You can edit the cluster configuration if needed:

bash

$ kops edit cluster

Once you’re satisfied with the configuration, build the cluster:

bash

$ kops update cluster --name ${KOPS_CLUSTER_NAME} --yes

After a few minutes, validate the cluster to ensure that the master and nodes have launched successfully:

bash

$ kops validate cluster

Finally, verify that the Kubernetes nodes are up and running:

bash

$ kubectl get nodes

Congratulations! You now have a fully functional Kubernetes cluster running on AWS. To further explore the capabilities of Kubernetes, consider deploying applications such as the Kubernetes Dashboard for managing your cluster with ease. Enjoy your journey into the world of Kubernetes!

The post Setting Up a Kubernetes Cluster on AWS in 5 Minutes appeared first on Ramhise.

I Was on a Machine Learning for Geosciences Podcast

Trujillo Herman — Tue, 19 Mar 2024 08:35:00 +0000

I recently had the pleasure of being a guest on a machine learning podcast called Undersampled Radio, and it was a blast! Hosted by Gram Ganssle and Matt Hall, the podcast delved into various topics surrounding the intersection of machine learning and the geosciences, with a particular focus on the oil and gas industry, where I work at Novi Labs.

During the episode, we covered a range of intriguing subjects:

Introduction: Getting to know each other and setting the stage for the conversation.
Austin Deep Learning: Exploring the machine learning scene in Austin, Texas, where the podcast is based.
Overview of Novi Labs: Discussing the role of Novi Labs in leveraging machine learning for the oil and gas sector.
Predicting Oil and Gas Production: Delving into the complexities and challenges of predicting production in the oil and gas industry using machine learning techniques.
Do we need experts?: Considering the role of domain expertise in conjunction with machine learning algorithms.
AI vs Physics Models: Comparing the strengths and weaknesses of artificial intelligence models with traditional physics-based models.
Karpatne paper: Machine Learning for the Geosciences: Reflecting on the insights and implications of the Karpatne paper regarding the application of machine learning in geosciences.
Answering scientific questions with machine learning: Exploring how machine learning can contribute to answering fundamental scientific questions in geosciences.
What to study in school for machine learning: Offering advice for individuals interested in pursuing a career in machine learning, particularly in the geosciences field.
Puzzle: Engaging in a thought-provoking puzzle or challenge.
What we’re currently reading: Sharing recommendations for interesting books or articles related to machine learning and geosciences.

Overall, it was an enriching and enjoyable experience, and I’m grateful to the hosts for their hospitality and thought-provoking questions. If you’re interested in exploring the fascinating world of machine learning in the geosciences, I highly recommend giving Undersampled Radio a listen!

The post I Was on a Machine Learning for Geosciences Podcast appeared first on Ramhise.

Autoencoders with Keras

Trujillo Herman — Fri, 16 Feb 2024 08:31:00 +0000

Autoencoders have become an intriguing tool for data compression, and implementing them in Keras is surprisingly straightforward. In this post, I’ll delve into autoencoders, borrowing insights from the Keras blog by Francois Chollet.

Autoencoders, unlike traditional compression methods like JPEG or MPEG, learn a specific lossy compression based on the data examples provided, rather than relying on broad assumptions about images, sound, or video. They consist of three main components:

Encoding function
Decoding function
Loss function

The encoding and decoding functions are typically neural networks, and they need to be differentiable with respect to the loss function to optimize the parameters effectively.

So, what are autoencoders good for?

Data denoising
Dimension reduction
Data visualization

For data denoising, autoencoders offer a nonlinear alternative to methods like PCA, which is linear. Additionally, dimension reduction is a natural outcome of the lossy compression process, aiding in denoising and pre-training for other machine learning algorithms.

Let’s explore the basics of autoencoders using Keras with the following models:

Simple Autoencoder
Deep Autoencoder
Convolutional Autoencoder
A second Convolutional Autoencoder for denoising images

First, let’s set up our environment and load the MNIST dataset for experimentation:

python

from IPython.display import Image, SVG
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import keras
from keras.datasets import mnist
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, Flatten, Reshape
from keras import regularizers

# Load and scale the MNIST dataset
(x_train, _), (x_test, _) = mnist.load_data()
max_value = float(x_train.max())
x_train = x_train.astype('float32') / max_value
x_test = x_test.astype('float32') / max_value
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

Now, let’s dive into the different types of autoencoders. We’ll start with a Simple Autoencoder.

The post Autoencoders with Keras appeared first on Ramhise.

Building Scikit-Learn Pipelines With Pandas DataFrames

Trujillo Herman — Tue, 09 Jan 2024 08:26:00 +0000

Working with scikit-learn alongside pandas DataFrames has often been a source of frustration due to the lack of seamless integration between the two. However, by leveraging scikit-learn’s Pipeline functionality, we can simplify this process significantly. In this post, I’ll walk you through building a scikit-learn Pipeline that seamlessly integrates with pandas DataFrames, making your machine learning workflows more efficient and intuitive.

Integrating scikit-learn with Pandas DataFrames

Scikit-learn operates primarily on numpy matrices, which don’t preserve important DataFrame attributes such as feature names and column data types. This lack of integration can make preprocessing and model building cumbersome, especially when dealing with categorical features and missing values.

To address these challenges, we’ll build a Pipeline with the following objectives:

Apply a ColumnSelector to filter relevant columns from the DataFrame
Use a TypeSelector to differentiate between numerical, categorical, and boolean features
Construct a preprocessing Pipeline to handle missing values, encode categorical features, and scale numerical features
Combine the preprocessing Pipeline with a classifier for model training and evaluation

Example with Churn Dataset

For our demonstration, we’ll use the churn binary classification dataset from the Penn Machine Learning Benchmarks. This dataset contains 5000 observations with 15 numeric features, 2 binary features, and 2 categorical features.

Let’s start by loading the dataset and setting appropriate column data types.

python

# Load dataset and set column data types
df = pmlb.fetch_data('churn', return_X_y=False)
# Define feature columns
x_cols = [c for c in df if c not in ["target", "phone number"]]
binary_features = ["international plan", "voice mail plan"]
categorical_features = ["state", "area code"]

Building the Pipeline Components

1. Column Selector

python

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

2. Type Selector

python

class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

3. Preprocessing Pipeline

python

preprocess_pipeline = make_pipeline(
    ColumnSelector(columns=x_cols),
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

Model Training and Evaluation

python

classifier_pipeline = make_pipeline(
    preprocess_pipeline,
    SVC(kernel="rbf", random_state=42)
)

param_grid = {
    "svc__gamma": [0.1 * x for x in range(1, 6)]
}

classifier_model = GridSearchCV(classifier_pipeline, param_grid, cv=10)
classifier_model.fit(X_train, y_train)

Conclusion

By building a scikit-learn Pipeline with pandas DataFrame-friendly components, we’ve simplified the integration process and created a streamlined workflow for preprocessing and model building. This approach enhances reproducibility, scalability, and readability of machine learning pipelines, ultimately leading to more efficient model development and deployment.

The post Building Scikit-Learn Pipelines With Pandas DataFrames appeared first on Ramhise.

High-Dimensional Microarray Data Sets in R for Machine Learning

Trujillo Herman — Sat, 09 Dec 2023 08:06:00 +0000

In my pursuit of machine learning research, I often delve into small-sample, high-dimensional bioinformatics datasets. A significant portion of my work focuses on exploring new methodologies tailored to these datasets. For example, I’ve published a paper discussing this very topic.

Many studies in the field of machine learning rely heavily on two prominent datasets: the Alon colon cancer dataset and the Golub leukemia dataset. Despite their popularity, both datasets were introduced in papers published back in 1999. This indicates a potential mismatch between existing methodologies and the advancements in data collection technology. Moreover, the Golub dataset, while widely used, isn’t ideal as a benchmark due to its well-separated nature, leading to nearly perfect classification by most methods.

To address this gap, I embarked on a mission to discover alternative datasets that could serve as valuable resources for researchers like myself. What initially started as a small-scale project quickly evolved into something more substantial. As a result, I’ve curated a collection of datasets and packaged them conveniently for easy access and analysis. This effort culminated in the creation of the datamicroarray package, which is now available on my GitHub account.

Each dataset included in the package comes with a script for downloading, cleaning, and storing the data as a named list. For detailed instructions on data storage and usage, refer to the README file provided with the package. Currently, the datamicroarray package comprises 20 datasets specifically tailored for assessing machine learning algorithms and models in the context of small-sample, high-dimensional data.

Additionally, I’ve supplemented the package with a comprehensive wiki hosted on the GitHub repository. This wiki serves as a valuable resource, offering detailed descriptions of each dataset along with additional information, including links to the original papers for reference.

One challenge I’ve encountered is the large file size of the R package, primarily due to storing an RData file for each dataset. To mitigate this issue, I’m actively exploring alternative approaches for dynamically downloading data. I welcome any suggestions or contributions from the community in this regard. Additionally, I must acknowledge that some data descriptions within the package are incomplete, and I would greatly appreciate assistance in enhancing them.

Researchers are encouraged to leverage any of the datasets provided in the datamicroarray package for their work. However, it’s essential to ensure proper data processing before conducting analysis and incorporating the results into research endeavors.

The post High-Dimensional Microarray Data Sets in R for Machine Learning appeared first on Ramhise.

Bye Bye, Jekyll. Hello, Hugo

Trujillo Herman — Fri, 08 Dec 2023 08:19:00 +0000

I’ve decided to bid farewell to Jekyll after encountering one too many Liquid errors and dealing with the Ruby dependency headache. In its place, I’m embracing Hugo—a swift transition that eliminates any excuses hindering my posting schedule.

Here are some key highlights about Hugo:

Developed using Go.
User-friendly interface.
Lightning-fast performance, with local site rebuilds completing in a mere ~100 milliseconds.
Immediate feedback during the writing process due to its speed.

Transitioning to Hugo from Jekyll was a breeze, and most of the process went smoothly without any major hiccups. Initially, this post served as a way to conduct a smoke test and organize my thoughts on the migration.

Creating a basic Hugo site is straightforward. With Homebrew installed on my Mac, installing Hugo is as simple as:

bash

brew install hugo

Next, let’s set up a minimalist blog and draft our inaugural post:

bash

hugo new site blog-ramhiser
cd blog-ramhiser
hugo new posts/my-first-post.md
echo 'MY FIRST POST' >> content/posts/my-first-post.md

I’m particularly fond of Hugo’s cactus theme for its sleek appearance. Installing Hugo themes via git submodules is hassle-free. After adding the cactus theme, I copied the example config.toml configuration:

bash

git init
git submodule add git@github.com:digitalcraftsman/hugo-cactus-theme.git themes/hugo-cactus-theme
cp themes/hugo-cactus-theme/exampleSite/config.toml .

To preview the blog, simply run:

bash

hugo server

Now, you’ll have a stylish and minimalist site up and running.

Importing from Jekyll to Hugo is seamless and doesn’t require any third-party tools or plugins. It’s a matter of executing a command:

bash

hugo import jekyll ~/jekyll_blog/ ~/hugo_blog/

Moving on from GitHub Pages to Netlify was a decision spurred by recent issues with GitHub Pages, likely stemming from Jekyll. The process of migrating to Netlify was swift and painless, thanks to the helpful documentation:

Create a new GitHub repository for the Hugo blog.
Add a netlify.toml config file specifying Hugo version 0.31.1.
Create a Netlify account.
Set up a new Netlify site linked to the GitHub repo.
Update DNS entries on Namecheap to point to Netlify’s servers.
Request a Let’s Encrypt TLS certificate via Netlify (which only took a few minutes).
Implement HTTP to HTTPS redirection for enhanced security.

With these steps completed, I’m ready to embark on a new blogging journey with Hugo and Netlify at my side.

The post Bye Bye, Jekyll. Hello, Hugo appeared first on Ramhise.

Installing TensorFlow on an AWS EC2 Instance with GPU Support

Trujillo Herman — Sun, 05 Nov 2023 07:58:00 +0000

Here’s a guide on installing TensorFlow 0.6 on an Amazon EC2 Instance with GPU Support. Additionally, a Public AMI (ami-e191b38b) is provided with the configured setup for convenience.

Note: Updated on Jan 28, 2016, to reflect the requirement of Bazel 0.1.4 and to export environment variables in ~/.bashrc.

The installation includes:

Essentials
Cuda Toolkit 7.0
cuDNN Toolkit 6.5
Bazel 0.1.4 (requires Java 8)
TensorFlow 0.6

To begin, it’s recommended to request a spot instance to save costs. Launch a g2.2xlarge instance using the Ubuntu Server 14.04 LTS AMI.

After instance launch, install essentials:

bash

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install -y build-essential git python-pip libfreetype6-dev libxft-dev libncurses-dev libopenblas-dev gfortran python-matplotlib libblas-dev liblapack-dev libatlas-base-dev python-dev python-pydot linux-headers-generic linux-image-extra-virtual unzip python-numpy swig python-pandas python-sklearn unzip wget pkg-config zip g++ zlib1g-dev
sudo pip install -U pip

Next, install CUDA Toolkit 7.0:

bash

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
rm cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda

Download and install cuDNN Toolkit 6.5:

bash

tar -zxf cudnn-6.5-linux-x64-v2.tgz && rm cudnn-6.5-linux-x64-v2.tgz
sudo cp -R cudnn-6.5-linux-x64-v2/lib* /usr/local/cuda/lib64/
sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda/include/

Reboot the instance:

bash

sudo reboot

Set environment variables:

bash

echo "export CUDA_HOME=/usr/local/cuda" >> ~/.bashrc
echo "export CUDA_ROOT=/usr/local/cuda" >> ~/.bashrc
echo "export PATH=$PATH:$CUDA_ROOT/bin" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64" >> ~/.bashrc
source ~/.bashrc

Install Java 8 and Bazel 0.1.4:

bash

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get install -y oracle-java8-installer
sudo apt-get install pkg-config zip g++ zlib1g-dev
wget https://github.com/bazelbuild/bazel/releases/download/0.1.4/bazel-0.1.4-installer-linux-x86_64.sh
chmod +x bazel-0.1.4-installer-linux-x86_64.sh
./bazel-0.1.4-installer-linux-x86_64.sh --user
rm bazel-0.1.4-installer-linux-x86_64.sh

Clone TensorFlow repo:

bash

git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow

Build TensorFlow with GPU support:

bash

TF_UNOFFICIAL_SETTING=1 ./configure

During configuration, choose CUDA version 3.0. Then, build TensorFlow:

bash

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

Congratulations! TensorFlow is now installed with GPU support. Test the installation by running Python code that utilizes TensorFlow. You should see GPU-related messages indicating successful setup.

This guide is a compilation of instructions from various sources, with credits and thanks to the original contributors. For more information and options, refer to TensorFlow’s official installation instructions.

The post Installing TensorFlow on an AWS EC2 Instance with GPU Support appeared first on Ramhise.