Coding Archives - Ramhise

Configuring IPython Notebook Support for PySpark

Trujillo Herman — Fri, 13 Oct 2023 07:33:00 +0000

Apache Spark is a powerful tool for handling large-scale data processing, and PySpark offers a convenient way to interact with Spark using Python. Recently, a colleague and I wondered if it was possible to run PySpark from within an IPython Notebook. As it turns out, setting up PySpark support in an IPython profile is quite straightforward.

Here’s a quick summary of the process:

Install Spark: Download the latest Spark release source, unzip it to your desired location (e.g., ~/spark-1.2.0/), install the Scala build tool (sbt) using Homebrew, and build Spark using sbt assembly.
Create PySpark Profile for IPython: Start by creating a new IPython profile for PySpark using the command ipython profile create pyspark. To avoid port conflicts, update the default port to 42424 within ~/.ipython/profile_pyspark/ipython_notebook_config.py. Set the necessary environment variables in .bashrc or .bash_profile, including SPARK_HOME and PYSPARK_SUBMIT_ARGS. Create a file named ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py to configure the Spark environment.
Launch IPython Notebook with PySpark Profile: Use the command ipython notebook –profile=pyspark to launch a notebook with the PySpark profile.
Word Count Example: Ensure that the PySpark profile has created a SparkContext by typing sc within the notebook. Then, load a text file into a Spark RDD (e.g., the Spark README file). Execute a word count script to split each line into words, map each word to a tuple with an initial count of 1, and sum up the counts for each word. Finally, execute the pipeline to count the words.

With these steps, you can configure IPython Notebook support for PySpark and perform tasks like word count analysis seamlessly within the notebook environment.

The post Configuring IPython Notebook Support for PySpark appeared first on Ramhise.

A Brief Look at Mixture Discriminant Analysis

Trujillo Herman — Tue, 26 Sep 2023 07:28:00 +0000

In my recent postdoctoral work focusing on data-driven automated gating, I’ve been extensively exploring finite mixture models. Despite only having a cursory introduction to mixture models in academic settings, I’ve found myself increasingly adept at working with them. This prompted me to delve into their application in classification tasks, particularly in scenarios where a single class comprises multiple non-adjacent subclasses.

To my knowledge, there are two primary approaches—albeit with numerous variants—to applying finite mixture models for classification:

The Fraley and Raftery approach, implemented in the mclust R package.
The Hastie and Tibshirani approach, implemented in the mda R package.

While both methods share similarities, I opted to delve into the latter approach. Here’s the gist: we consider K≥2 classes, each assumed to be a Gaussian mixture of subclasses. This generative model formulation leverages the posterior probability of class membership for classification of unlabeled observations. Each subclass is assumed to possess its own mean vector, with all subclasses sharing a common covariance matrix to maintain model parsimony. The model parameters are estimated via the Expectation-Maximization (EM) algorithm.

While diving into the intricacies of likelihood in the associated literature, I encountered some confusion regarding how to formulate the likelihood to determine each observation’s contribution to estimating the common covariance matrix in the EM algorithm’s M-step. If each subclass had its own covariance matrix, the likelihood would be straightforward—a simple product of individual class likelihoods. However, my confusion stemmed from crafting the complete data likelihood when classes share parameters.

To address this, I documented the likelihood explicitly and elucidated the details of the EM algorithm utilized for estimating model parameters. This document is readily available, alongside LaTeX and R code, via the provided link. Should you choose to peruse the document, I welcome any feedback regarding confusing or poorly defined notations. Please note that I’ve omitted additional topics on reduced-rank discrimination and shrinkage.

To evaluate the efficacy of the mixture discriminant analysis (MDA) model, I devised a simple toy example featuring three bivariate classes, each comprising three subclasses. These subclasses were strategically positioned to ensure non-adjacency within each class, resulting in non-Gaussian class distributions. My aim was to assess the MDA classifier’s ability to identify subclasses and compare its decision boundaries with those of linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), implemented using the MASS package. From the scatterplots and decision boundaries depicted below, LDA and QDA classifiers exhibited expectedly puzzling decision boundaries, whereas the MDA classifier effectively identified subclasses. It’s worth noting that in this example, all subclasses share the same covariance matrix, aligning with the MDA classifier’s assumption. Exploring the classifier’s sensitivity to deviations from this assumption and its performance as feature dimensionality surpasses sample size would be intriguing avenues for future investigation.

The post A Brief Look at Mixture Discriminant Analysis appeared first on Ramhise.

Installing Python Data Science Stack on Yosemite

Trujillo Herman — Tue, 22 Aug 2023 07:30:00 +0000

While setting up a Python data-science stack within a fresh virtual environment on my Mac running OS X 10.10.1 (Yosemite), I encountered several frustrating errors. Below, I outline the steps I took that eventually led to a successful installation.

Initially, my main objective was to install version 0.15.2 of scikit-learn using pip install -U scikit-learn. However, I encountered errors during the scipy installation process. While numpy 1.9.1 was successfully added, scipy 0.15.1 installation failed. I attempted to install scipy 0.15.1 individually but encountered the following error:

[Error message]

After conducting a few Google searches and some trial and error, I attempted a fix based on instructions found in a StackOverflow post. Despite following these instructions, I still encountered errors. Here are the initial steps I took:

Download and install XCode Command Line Tools from Apple.
Installing scipy continued to fail at this point.
Executed sudo xcode-select –switch /Applications/Xcode.app/Contents/Developer/.
Ran brew update followed by brew doctor.
Attempted pip install -U scipy, which failed with the same error message.
Found suggestions in a random README file and executed the following:

bash

export CC=clang
export CXX=clang++
export LDFLAGS='-L/opt/X11/lib'
export CFLAGS='-I/opt/X11/include -I/opt/X11/include/freetype2'

However, this resulted in a new error message:

[New error message]

Based on a couple more posts, I decided to unset LDFLAGS and CFLAGS. After doing so, I attempted to install again but encountered the same error message as before.

At this juncture, feeling quite frustrated, I closed my terminal iTerm2 and reopened it. Omitting the LDFLAGS and CFLAGS options, I set:

bash

export CC=clang
export CXX=clang++

This time, I successfully installed both scipy and scikit-learn without encountering any errors.

The post Installing Python Data Science Stack on Yosemite appeared first on Ramhise.