John Ramey Statistics and Machine Learning

Configuring IPython Notebook Support for PySpark

Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. After a discussion with a coworker, we were curious whether PySpark could run from within an IPython Notebook. It turns out that this is fairly straightforward by setting up an IPython profile.

Installing Python Data Science Stack on Yosemite

I was attempting to install the Python data-science stack within a fresh virtual environment on my Mac with OS X 10.10.1 (Yosemite) but encountered various frustrating errors. I logged my steps below that eventually yielded a successful installation.

MLB Rankings Using the Bradley-Terry Model

Today, I take my first shots at ranking Major League Baseball (MLB) teams. I see my efforts at prediction and ranking an ongoing process so that my models improve, the data I incorporate are more meaningful, and ultimately my predictions are largely accurate. For the first attempt, let’s rank MLB teams using the Bradley-Terry (BT) model.

A Brief Look at Mixture Discriminant Analysis

Lately, I have been working with finite mixture models for my postdoctoral work on data-driven automated gating. Given that I had barely scratched the surface with mixture models in the classroom, I am becoming increasingly comfortable with them. With this in mind, I wanted to explore their application to classification because there are times when a single class is clearly made up of multiple subclasses that are not necessarily adjacent.

High-Dimensional Microarray Data Sets in R for Machine Learning

Much of my research in machine learning is aimed at small-sample, high-dimensional bioinformatics data sets. For instance, here is a paper of mine on the topic.