High-Dimensional Microarray Data Sets in R for Machine Learning

Much of my research in machine learning is aimed at small-sample, high-dimensional bioinformatics data sets. For instance, here is a paper of mine on the topic.

A large number of papers proposing new machine-learning methods that target high-dimensional data use the same two data sets and consider few others. These data sets are the 1) Alon colon cancer data set, and the 2) Golub leukemia data set. Both of the corresponding papers were published in 1999, which indicates that the methods are not keeping up with the data-collection techology. Furthermore, the Golub data set is not useful as a benchmark data set because it is well-separated so that most methods have nearly perfect classification.

My goal has been to find several alternative data sets and provide them in a convenient location so that I could load and analyze them easily and then incorporate the results into my papers. Initially, I aimed to identify a few more data sets, but after I got going on this effort, I found a lot more. What started as a small project turned into something that has saved me a lot of time. I have created the datamicroarray package available from my GitHub account. For each data set included in the package, I have provided a script to download, clean, and save the data set as a named list. See the README file for more details about how the data are stored.

Currently, the package consists of 20 small-sample, high-dimensional data sets to assess machine learning algorithms and models. I have also included a wiki on the package’s GitHub repository that describes each data set and provides additional information, including a link to the original papers.

The biggest drawback at the moment is the file size of the R package because I store an RData file for each data set. I am investigating alternative approaches to download the data dynamically and am open to suggestions. Also note that the data descriptions are incomplete, so assistance is appreciated.

Feel free to use any of the data sets. As a disclaimer, you should ensure that the data are processed correctly before analyzing and incorporating the results into your own work.

How to Download Kaggle Data With Python and requests.py

Recently I started playing with Kaggle. I quickly became frustrated that in order to download their data I had to use their website. I prefer instead the option to download the data programmatically. After some Googling, the best recommendation I found was to use lynx. My friend Anthony recommended that alternatively I should write a Python script.

Although Python is not my primary language, I was intrigued by how simple it was to write the script using requests.py. In this example, I download the training data set from Kaggle’s Digit Recognizer competition.

The idea is simple:

1. Attempt to download a file from Kaggle but get blocked because you are not logged in.
2. Login with requests.py.

Here’s the code:

Simply change my_username and my_password to your Kaggle login info. Feel free to optimize the chunk size to your liking.

Setting Up the Development Version of R

My coworkers at Fred Hutchinson regularly use the development version of R (i.e., R-devel) and have urged me to do the same. This post details how I have set up the development version of R on our Linux server, which I use remotely because it is much faster than my Mac.

First, I downloaded the R-devel source into ~/local/, which is short for /home/jramey/local/ via Subversion, configured my installation, and compiled the source. I recommend these Subversion tips if you are building from source. Here are the commands to install R-devel.

The third command downloads the recommended R packages and is crucial because the source for the recommended R packages is not included in the SVN repository. For more about this, go here.

We have the release version (currently, it is 2.15.1) installed in /usr/local/bin. But the goal here is to give priority to R-devel. So, I add the following to my ~/.bashrc file:

Notice that the last line that I add to my ~/.bashrc file is to load R-devel quietly without saving or restoring.

Next, I install the R packages that I use the most.

Then, I update my .Rprofile file, which I keep in a Github gist.

Finally, my coworkers focus on flow cytometry data, and our group maintains several Bioconductor packages related to this type of data. To install the majority of them, we simply install the flowWorkspace package in R:

Chapter 2 Solutions - Statistical Methods in Bioinformatics

As I have mentioned previously, I have begun reading Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. In this post, I will give my solution to two problems. The first problem is pretty straightforward.

Problem 2.20

Suppose that a parent of genetic type Mm has three children. Then the parent transmits the M gene to each child with probability 1/2, and the genes that are transmitted to each of the three children are independent. Let $I_1 = 1$ if children 1 and 2 had the same gene transmitted, and $I_1 = 0$ otherwise. Similarly, let $I_2 = 1$ if children 1 and 3 had the same gene transmitted, $I_2 = 0$ otherwhise, and let $I_3 = 1$ if children 2 and 3 had the same gene transmitted, $I_3 = 0$ otherwise.

The question first asks us to how that the three random variables are pairwise independent but not independent. The pairwise independence comes directly from the bolded phrase in the problem statement. Now, to show that the three random variables are not independent, denote by $p_j$ the probability that $I_j = 1$, $j = 1, 2, 3$. If we had independence, then the following statement would be true:

However, notice that the event in the lefthand side can never happen because if $I_1 = 1$ and $I_2 = 1$, then $I_3$ must be 1. Hence, the lefthand side must equal 0, while the righthand side equals 1/8. Therefore, the three random variables are not independent.

The question also asks us to discuss why the variance of $I_1 + I_2 + I_3$ is equal to the sum of the individual variances. Often, this is only the case of the random variables are independent. But because the random variables here are pairwise independent, the covariances must be 0. Thus, the equality must hold.

Problems 2.23 - 2.27

While I worked the above problem because of its emphasis on genetics, the following set of problems is much more fun in terms of the mathematics because of its usage of approximations.

For $i = 1, \ldots, n$, let $X_i$ be the $i$th lifetime of certain cellular proteins until degradation. We assume that $X_1, \ldots, X_n$ are iid random variables, each of which is exponentially distributed with rate parameter $\lambda > 0$. Furthermore, let $n = 2m + 1$ be an odd integer.

This set of questions is concerned with the mean and variance of the sample median, $X_{(m + 1)}$, where $X_{(i)}$ denotes the $i$th order statistic. First, note that the mean and variance of the minimum value $X_{(1)}$ are $1/(n\lambda)$ and $1/(n\lambda)^2$, respectively. From the memoryless property of the exponential distribution, the mean value of the time until the next protein degrades is independent of the previous. However, there are now $n - 1$ proteins remaining. Thus, the mean and variance of $X_{(2)}$ are $1/(n\lambda) + 1/((n-1)\lambda)$ and $1/(n\lambda)^2 + 1/((n-1)\lambda)^2$, respectively. Continuining in this manner, we have

and

Approximation of $E[X_{(m + 1)}]$

Now, we wish to approximate the mean with a much simpler formula. First, from (B.7) in Appendix B, we have

where $\gamma$ is Euler’s constant. Then, we can write the expected sample median as

Hence, as $n \rightarrow \infty$, this approximation goes to $\frac{\log 2}{\lambda}$, which is the median of an exponentially distributed random variable. Specifically, the median is the solution to $F_X(x) = 1/2$, where $F_X$ denotes the cumulative distribution function of the random variable $X$.

Improved Approximation of $E[X_{(m + 1)}]$

It turns out that we can improve this approximation with the following two results:

Following the derivation of our above approximation, we have that

Approximation of $Var[X_{(m + 1)}]$

We can also approximate $Var[X_{(m + 1)}]$ using the approximation

With $a = m+1$ and $b = 2m + 1$, we have

Textbook - Statistical Methods in Bioinformatics

As part of my effort to acquaint myself more with biology, bioinformatics, and statistical genetics, I am trying to find as many resources as I can that provide a solid foundation. For instance, I am wading through Molecular Biology of the Cell at a pace of about 10-15 pages per day – this takes nearly an hour every day.

I am also going through Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. My intention is to post my solutions to these chapter exercises. Thus far, I have made it through the first three chapters, and I will begin posting my solutions soon. I am interested particularly in problems regarding statistical topics with which I have little-to-no experience and also topics where I lack intuition regarding the biological applications.

Here is a thumbnail of the book:

Now That We Live in Seattle

It has been just a few weeks since my wife, my son, and I moved to Seattle so that I could begin my postdoc at The Hutch. Now that we have been here a short time and are settled, we intend to start exploring Seattle, doing typical touristy things as well as non-touristy activities that only Seattlites would do. My wife purchased a detailed guide to Seattle that lists numerous activities, restaurants, scenery, etc. that would take months (years?) to complete. I prefer word of mouth though, so I asked some coworkers for recommendations. Here’s what they gave me:

My coworkers recommended parking at The Hutch and riding The Slut to Westlake Center and then walking to Pike Place – avoids traffic. Also, they recommended thta we combine Discovery Park and Ballard into one day. They said the view of Seattle from the ferry when returning from Bainbridge Island is breathtaking.

My wife and I like to hike, so I also asked my coworkers for recommendations. Overwhelmingly, they said these two places:

Any other recommendations? In particular, are there any restaurants that you’d recommend?

And Now I Blog Again

One of my goals for 2012 has been to blog more. Much more. When I first set this goal, I had great aspirations of posting frequently. However, I had a Ph.D. to complete, and quite frankly, it demanded much higher priority. Now that I have submitted my dissertation and completed my Ph.D. requirements, I have several half-finished posts that will appear soon. Also, since I have made the switch to Octopress, I will be relocating selected posts from my previous Wordpress blog.

Goals for 2012

I have never been one to set New Year’s resolutions. Personally, they instill a dangerous personal freedom that often yield naive, subconscious mentalities, such as I can do anything I want until December 31, and I will change abruptly the next day. However, my Ph.D. adviser has shown me the importance of setting goals in all things that I wish to accomplish as well as envisioning the finale to an arduous journey like a small child (read “John Ramey”) that pictures the waning warmth of fresh chocolate chip cookies smeared on his face. My adviser has always encouraged short-term and long-term goals but never required them. As I recently found out, my employer does. In addition, these goals are reviewed at the end of the fiscal year so that employees are realistic and held accountable.

Now that I must list these formally at work, I have decided to post a number of career and personal goals here in order to hold myself accountable at the end of 2012 with the implicit assumption that the world does not end. So, one year from now, if we are still here (chuckle), I will review my goal-completion success.

• Read to my son (almost) nightly.
• Hear my son laugh (almost) nightly.
• Take my wife out for a date night each week.
• Treat my wife to a significant outing each month.
• Finish dissertation.
• Successfully defend dissertation.
• Submit 4-6 articles for publication.
• Submit at least 3 R packages to CRAN.
• Attend at least three conferences.
• Make at least two conference presentations.
• Find/maintain employment.
• Construct detailed plan and outline for my textbook.
• Take a real vacation.
• Run a half marathon. (I will consider this a success if I have signed up for an early 2013 race.)
• Transition my personal website from Wordpress to Octopress.
• Blog more.
• Check Tweets and email twice daily at a scheduled time.
• Spend time with my extended family.
• Read the literature at a scheduled time.
• Finish reading Izenman’s Multivariate text.
• Read Lehmann’s Reminiscences of a Statistician: The Company I Kept.
• Read Ewens and Grant’s bioinformatics text.
• Read Bishop’s PRML text.
• Read Gaussian Processes for Machine Learning.
• Read Barber’s Bayesian Reasoning and Machine Learning.
• Read a significant portion of Devroye et al.’s Probabilistic Theory of Pattern Recognition text.
• Reread Robert’s The Bayesian Choice.
• Read Jaynes’ Probability Theory text.
• Read Berger’s Decision Theory text.
• Watch Boyd’s Convex Optimization lectures.
• Read the Boyd Convex Optimization text.
• Finish reading Reamde.
• Read Rothfuss’ The Name of the Wind.
• Become more proficient at debugging in R.

As I have assuredly not remembered the goals that I have verbally set, this list may change over the next few days.

When I Was 29…

Today was my 29th birthday, and I kept things simple: I ate with my wife and my newborn son at a local eatery. Later, my wife cooked steaks for dinner. For the most part, I took the day off in that I did not work on my dissertation. But I did spent much of the day tinkering with my Octopress installation – more about that later.

The best part of the day was hearing my son laugh again and again. Yes, there were some messy diapers, but those are easily forgotten when my son gets the giggles.

Steve Jobs’ 2005 Stanford Commencement Address

Given that there are almost 13 million views of Steve Jobs’ commencement address, I am certain that I missed this video when it went viral. I am glad that I did not see it until now because I may not have appreciated his words of wisdom. And although there are numerous quotes that I could list, I think my favorites are closely related to a few words of wisdom that my grandmother told me when I was younger.

Don’t settle.

Stay hungry. Stay foolish.

You’ve got to find what you love, and that is as true for work as is for your lovers. Your work is gonna fill a large part of your life, and the only way to truly be satisfied is to do what you believe is great work, and the only way to do great work is to love what you do.

It’s good to be reminded of things like this once in a while.