John Ramey

Statistics, Machine Learning, and R.

MLB Rankings Using the Bradley-Terry Model

Today, I take my first shots at ranking Major League Baseball (MLB) teams. I see my efforts at prediction and ranking an ongoing process so that my models improve, the data I incorporate are more meaningful, and ultimately my predictions are largely accurate. For the first attempt, let’s rank MLB teams using the Bradley-Terry (BT) model.

Before we discuss the rankings, we need some data. Let’s scrape ESPN’s MLB Standings Grid for a win-loss matchups of any two MLB teams for the current season. Perhaps to simplify the tables and to reduce the sparsity resulting from interleague play, ESPN provides only the matchup records within a single league – American or National. Accompanying the matchups, the data include a team’s overall record versus the other league, but we will ignore this for now. The implication is that we can rank teams only within the same league.

Scraping ESPN with a Python Script

In the following Python script, the BeautifulSoup library is used to scrape ESPN’s site for a given year. The script identifies each team in the American League table, their opponents, and their records against each opponent. The results are outputted in a CSV file to analyze in R. The code is for the American League only, but it is straightforward to modify the code to gather the National League data. Below, I use only the data for 2013 and ignore the previous seasons. In a future post though, I will incorporate these data.

Here’s the Python code. Feel free to fork it.

Bradley-Terry Model

The BT model is a simple approach to modeling pairwise competitions, such as sporting events, that do not result in ties and is well-suited to the ESPN data above where we know only the win-loss records between any two teams. (If curious, ties can be handled with modifications.)

Suppose that teams and play each other, and we wish to know the probability that team will beat team . Then, with the BT model we define

where and denote the abilities of teams and , respectively. Besides calculating the probability of one team beating another, the team abilities provide a natural mechanism for ranking teams. That is, if , we say that team is ranked superior to team , providing an ordering on the teams within a league.

Perhaps naively, we assume that all games are independent. This assumption makes it straightforward to write the likelihood, which is essentially the product of Bernoulli likelihoods representing each team matchup. To estimate the team abilities, we use the BradleyTerry2 R package. The package vignette provides an excellent overview of the Bradley-Terry model as well as various approaches to incorporating covariates (e.g., home-field advantage) and random effects, some of which I will consider in the future. One thing to note is that the ability of the first team appearing in the results data frame is used as a reference and is set to 0.

I have placed all of the R code used for the analysis below within bradley-terry.r in this GitHub repository. Note that I use the ProjectTemplate package to organize the analysis and to minimize boiler-plate code.

After scraping the matchup records from ESPN, the following R code prettifies the data and then fits the BT model to both data sets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Cleans the American League (AL) and National League (NL) data scraped from
# ESPN's MLB Grid
AL_cleaned <- clean_ESPN_grid_data(AL.standings, league = "AL")
NL_cleaned <- clean_ESPN_grid_data(NL.standings, league = "NL")

# Fits the Bradley-Terry models for both leagues
set.seed(42)
AL_model <- BTm(cbind(Wins, Losses), Team, Opponent, ~team_, id = "team_", data = AL_cleaned$standings)
NL_model <- BTm(cbind(Wins, Losses), Team, Opponent, ~team_, id = "team_", data = NL_cleaned$standings)

# Extracts team abilities for each league
AL_abilities <- data.frame(BTabilities(AL_model))$ability
names(AL_abilities) <- AL_cleaned$teams

NL_abilities <- data.frame(BTabilities(NL_model))$ability
names(NL_abilities) <- NL_cleaned$teams

Next, we create a heatmap of probabilities winning for each matchup by first creating a grid of the probabilities. Given that the inverse logit of 0 is 0.5, the probability that team beats itself is estimated as 0.5. To avoid this confusing situation, we set these probabilities to 0. The point is that these events can never happen unless you play for Houston or have A-Rod on your team.

1
2
3
4
5
6
7
8
9
AL_probs <- outer(AL_abilities, AL_abilities, prob_BT)
diag(AL_probs) <- 0
AL_probs <- melt(AL_probs)

NL_probs <- outer(NL_abilities, NL_abilities, prob_BT)
diag(NL_probs) <- 0
NL_probs <- melt(NL_probs)

colnames(AL_probs) <- colnames(NL_probs) <- c("Team", "Opponent", "Probability")

Now that the rankings and matchup probabilities have been computed, let’s take a look at the results for each league.

American League Results

The BT model provides a natural way of ranking teams based on the team-ability estimates. Let’s first look at the estimates.

plot of chunk AL_team_abilities_barplot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## |     | ability | s.e.  |
## |-----+---------+-------|
## | ARI | 0.000   | 0.000 |
## | ATL | 0.461   | 0.267 |
## | CHC | -0.419  | 0.264 |
## | CIN | 0.267   | 0.261 |
## | COL | 0.015   | 0.250 |
## | LAD | 0.324   | 0.255 |
## | MIA | -0.495  | 0.265 |
## | MIL | -0.126  | 0.260 |
## | NYM | -0.236  | 0.262 |
## | PHI | -0.089  | 0.261 |
## | PIT | 0.268   | 0.262 |
## | SD  | -0.176  | 0.251 |
## | SF  | -0.100  | 0.251 |
## | STL | 0.389   | 0.262 |
## | WSH | -0.013  | 0.265 |

(Please excuse the crude tabular output. I’m not a fan of how Octopress renders tables. Suggestions?)

The plot and the table give two representations of the same information. In both cases we can see that the team abilities are standardized so that Baltimore has an ability of 0. We also see that Tampa Bay is considered the top AL team with Boston being a close second. Notice though that the standard errors here are large enough that we might question the rankings by team ability. For now, we will ignore the standard errors, but this uncertainty should be taken into account for predicting future games.

The Astros stand out as the worse team in the AL. Although the graph seems to indicate that Houston is by far worse than any other AL team, the ability is not straightforward to interpret. Rather, using the inverse logit function, we can compare more directly any two teams by calculating the probability that one team will beat another.

A quick way to compare any two teams is with a heatmap. Notice how Houston’s probability of beating another AL team is less than 50%. The best team, Tampa Bay, has more than a 50% chance of beating any other AL team.

plot of chunk AL_matchup_heatmaps

While the heatmap is useful for comparing any two teams at a glance, bar graphs provide a more precise representation of who will win. Here are the probabilities that the best and worst teams in the AL will beat any other AL team. A horizontal red threshold is drawn at 50%.

plot of chunk AL_probs_top_team

plot of chunk AL_probs_bottom_team

An important thing to notice here is that Tampa Bay is not unbeatable, according to the BT model, the Astros have a shot at winning against any other AL team.

plot of chunk AL_probs_middle_team

I have also found that a useful gauge is to look at the probability that an average team will beat any other team. For instance, Cleveland is ranked in the middle according to the BT model. Notice that half of the teams have greater than 50% chance to beat them, while the Indians have more than 50% chance of beating the remaining teams. The Indians have a very good chance of beating the Astros.

National League Results

Here, we repeat the same analysis for the National League.

plot of chunk NL_team_abilities_barplot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## |     | ability | s.e.  |
## |-----+---------+-------|
## | ARI | 0.000   | 0.000 |
## | ATL | 0.461   | 0.267 |
## | CHC | -0.419  | 0.264 |
## | CIN | 0.267   | 0.261 |
## | COL | 0.015   | 0.250 |
## | LAD | 0.324   | 0.255 |
## | MIA | -0.495  | 0.265 |
## | MIL | -0.126  | 0.260 |
## | NYM | -0.236  | 0.262 |
## | PHI | -0.089  | 0.261 |
## | PIT | 0.268   | 0.262 |
## | SD  | -0.176  | 0.251 |
## | SF  | -0.100  | 0.251 |
## | STL | 0.389   | 0.262 |
## | WSH | -0.013  | 0.265 |

For the National League, Arizona is the reference team having an ability of 0. The Braves are ranked as the top team, and the Marlins are the worst team. At first glance, the differences in National League team abilities between two consecutively ranked teams are less extreme than the American League. However, it is unwise to interpret the abilities in this way. As with the American League, we largely ignore the standard errors, although it is interesting to note that the top and bottom NL team abilities have more separation between them when the standard error is taken into account.

As before, let’s look at the matchup probabilities.

plot of chunk NL_matchup_heatmaps

From the heatmap we can see that the Braves have at least a 72% chance of beating the Marlins, according to the BT model. All other winning probabilities are less than 72%, giving teams like the Marlins, Cubs, and Mets a shot at winning.

Again, we plot the probabilities for the best and the worst teams along with an average team.

plot of chunk NL_probs_top_team

1
2
3
4
ATL_probs <- subset(NL_probs, Team == "ATL" & Opponent != "ATL")
prob_ATL_SF <- subset(ATL_probs, Opponent == "SF")$Probability
series_probs <- data.frame(Wins = 0:3, Probability = dbinom(0:3, 3, prob_ATL_SF))
print(ascii(series_probs, include.rownames = FALSE, digits = 3), type = "org")
1
2
3
4
5
6
## | Wins  | Probability |
## |-------+-------------|
## | 0.000 | 0.048       |
## | 1.000 | 0.252       |
## | 2.000 | 0.442       |
## | 3.000 | 0.258       |

I find it very interesting that the probability Atlanta beats any other NL team is usually around 2/3. This makes sense in a lot of ways. For instance, if Atlanta has a three-game series with the Giants, odds are good that Atlanta will win 2 of the 3 games. Moreover, as we can see in the table above, there is less than a 5% chance that the Giants will sweep Atlanta.

plot of chunk NL_probs_bottom_team

The BT model indicates that the Miami Marlins are the worst team in the National League. Despite their poor performance this season, except for the Braves and the Cardinals, the Marlins have a legitimate chance to beat other NL teams. This is especially the case against the other bottom NL teams, such as the Cubs and the Mets.

plot of chunk NL_probs_middle_team

What’s Next?

The above post ranked the teams within the American and National leagues separately for the current season, but similar data are also available on ESPN going back to 2002. With this in mind, obvious extensions are:

  • Rank the leagues together after scraping the interleague play matchups.

  • Examine how ranks change over time.

  • Include previous matchup records as prior information for later seasons.

  • Predict future games. Standard errors should not be ignored here.

  • Add covariates (e.g., home-field advantage) to the BT model.

A Brief Look at Mixture Discriminant Analysis

Lately, I have been working with finite mixture models for my postdoctoral work on data-driven automated gating. Given that I had barely scratched the surface with mixture models in the classroom, I am becoming increasingly comfortable with them. With this in mind, I wanted to explore their application to classification because there are times when a single class is clearly made up of multiple subclasses that are not necessarily adjacent.

As far as I am aware, there are two main approaches (there are lots and lots of variants!) to applying finite mixture models to classfication:

  1. The Fraley and Raftery approach via the mclust R package

  2. The Hastie and Tibshirani approach via the mda R package

Although the methods are similar, I opted for exploring the latter method. Here is the general idea. There are classes, and each class is assumed to be a Gaussian mixuture of subclasses. Hence, the model formulation is generative, and the posterior probability of class membership is used to classify an unlabeled observation. Each subclass is assumed to have its own mean vector, but all subclasses share the same covariance matrix for model parsimony. The model parameters are estimated via the EM algorithm.

Because the details of the likelihood in the paper are brief, I realized I was a bit confused with how to write the likelihood in order to determine how much each observation contributes to estimating the common covariance matrix in the M-step of the EM algorithm. Had each subclass had its own covariance matrix, the likelihood would simply be the product of the individual class likelihoods and would have been straightforward. The source of my confusion was how to write the complete data likelihood when the classes share parameters.

I decided to write up a document that explicitly defined the likelihood and provided the details of the EM algorithm used to estimate the model parameters. The document is available here along with the LaTeX and R code. If you are inclined to read the document, please let me know if any notation is confusing or poorly defined. Note that I did not include the additional topics on reduced-rank discrimination and shrinkage.

To see how well the mixture discriminant analysis (MDA) model worked, I constructed a simple toy example consisting of 3 bivariate classes each having 3 subclasses. The subclasses were placed so that within a class, no subclass is adjacent. The result is that no class is Gaussian. I was interested in seeing if the MDA classifier could identify the subclasses and also comparing its decision boundaries with those of linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). I used the implementation of the LDA and QDA classifiers in the MASS package. From the scatterplots and decision boundaries given below, the LDA and QDA classifiers yielded puzzling decision boundaries as expected. Contrarily, we can see that the MDA classifier does a good job of identifying the subclasses. It is important to note that all subclasses in this example have the same covariance matrix, which caters to the assumption employed in the MDA classifier. It would be interesting to see how sensitive the classifier is to deviations from this assumption. Moreover, perhaps a more important investigation would be to determine how well the MDA classifier performs as the feature dimension increases relative to the sample size.

LDA Decision Boundaries

QDA Decision Boundaries

MDA Decision Boundaries

Comparison of LDA, QDA, and MDA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
library(MASS)
library(mvtnorm)
library(mda)
library(ggplot2)

set.seed(42)
n <- 500

# Randomly sample data
x11 <- rmvnorm(n = n, mean = c(-4, -4))
x12 <- rmvnorm(n = n, mean = c(0, 4))
x13 <- rmvnorm(n = n, mean = c(4, -4))

x21 <- rmvnorm(n = n, mean = c(-4, 4))
x22 <- rmvnorm(n = n, mean = c(4, 4))
x23 <- rmvnorm(n = n, mean = c(0, 0))

x31 <- rmvnorm(n = n, mean = c(-4, 0))
x32 <- rmvnorm(n = n, mean = c(0, -4))
x33 <- rmvnorm(n = n, mean = c(4, 0))

x <- rbind(x11, x12, x13, x21, x22, x23, x31, x32, x33)
train_data <- data.frame(x, y = gl(3, 3 * n))

# Trains classifiers
lda_out <- lda(y ~ ., data = train_data)
qda_out <- qda(y ~ ., data = train_data)
mda_out <- mda(y ~ ., data = train_data)

# Generates test data that will be used to generate the decision boundaries via
# contours
contour_data <- expand.grid(X1 = seq(-8, 8, length = 300),
                            X2 = seq(-8, 8, length = 300))

# Classifies the test data
lda_predict <- data.frame(contour_data,
                          y = as.numeric(predict(lda_out, contour_data)$class))
qda_predict <- data.frame(contour_data,
                          y = as.numeric(predict(qda_out, contour_data)$class))
mda_predict <- data.frame(contour_data,
                          y = as.numeric(predict(mda_out, contour_data)))

# Generates plots
p <- ggplot(train_data, aes(x = X1, y = X2, color = y)) + geom_point()
p + stat_contour(aes(x = X1, y = X2, z = y), data = lda_predict)
  + ggtitle("LDA Decision Boundaries")
p + stat_contour(aes(x = X1, y = X2, z = y), data = qda_predict)
  + ggtitle("QDA Decision Boundaries")
p + stat_contour(aes(x = X1, y = X2, z = y), data = mda_predict)
  + ggtitle("MDA Decision Boundaries")

High-Dimensional Microarray Data Sets in R for Machine Learning

Much of my research in machine learning is aimed at small-sample, high-dimensional bioinformatics data sets. For instance, here is a paper of mine on the topic.

A large number of papers proposing new machine-learning methods that target high-dimensional data use the same two data sets and consider few others. These data sets are the 1) Alon colon cancer data set, and the 2) Golub leukemia data set. Both of the corresponding papers were published in 1999, which indicates that the methods are not keeping up with the data-collection techology. Furthermore, the Golub data set is not useful as a benchmark data set because it is well-separated so that most methods have nearly perfect classification.

My goal has been to find several alternative data sets and provide them in a convenient location so that I could load and analyze them easily and then incorporate the results into my papers. Initially, I aimed to identify a few more data sets, but after I got going on this effort, I found a lot more. What started as a small project turned into something that has saved me a lot of time. I have created the datamicroarray package available from my GitHub account. For each data set included in the package, I have provided a script to download, clean, and save the data set as a named list. See the README file for more details about how the data are stored.

Currently, the package consists of 20 small-sample, high-dimensional data sets to assess machine learning algorithms and models. I have also included a wiki on the package’s GitHub repository that describes each data set and provides additional information, including a link to the original papers.

The biggest drawback at the moment is the file size of the R package because I store an RData file for each data set. I am investigating alternative approaches to download the data dynamically and am open to suggestions. Also note that the data descriptions are incomplete, so assistance is appreciated.

Feel free to use any of the data sets. As a disclaimer, you should ensure that the data are processed correctly before analyzing and incorporating the results into your own work.

How to Download Kaggle Data With Python and requests.py

Recently I started playing with Kaggle. I quickly became frustrated that in order to download their data I had to use their website. I prefer instead the option to download the data programmatically. After some Googling, the best recommendation I found was to use lynx. My friend Anthony recommended that alternatively I should write a Python script.

Although Python is not my primary language, I was intrigued by how simple it was to write the script using requests.py. In this example, I download the training data set from Kaggle’s Digit Recognizer competition.

The idea is simple:

  1. Attempt to download a file from Kaggle but get blocked because you are not logged in.
  2. Login with requests.py.
  3. Download the data.

Here’s the code:

Simply change my_username and my_password to your Kaggle login info. Feel free to optimize the chunk size to your liking.

Setting Up the Development Version of R

My coworkers at Fred Hutchinson regularly use the development version of R (i.e., R-devel) and have urged me to do the same. This post details how I have set up the development version of R on our Linux server, which I use remotely because it is much faster than my Mac.

First, I downloaded the R-devel source into ~/local/, which is short for /home/jramey/local/ via Subversion, configured my installation, and compiled the source. I recommend these Subversion tips if you are building from source. Here are the commands to install R-devel.

1
2
3
4
5
6
svn co https://svn.r-project.org/R/trunk ~/local/R-devel
cd ~/local/R-devel
./tools/rsync-recommended
./configure --prefix=/home/jramey/local/
make
make install

The third command downloads the recommended R packages and is crucial because the source for the recommended R packages is not included in the SVN repository. For more about this, go here.

We have the release version (currently, it is 2.15.1) installed in /usr/local/bin. But the goal here is to give priority to R-devel. So, I add the following to my ~/.bashrc file:

1
2
3
4
5
PATH=~/local/bin:$PATH
export PATH

# Never save or restore when running R
alias R='R --no-save --no-restore-data --quiet'

Notice that the last line that I add to my ~/.bashrc file is to load R-devel quietly without saving or restoring.

Next, I install the R packages that I use the most.

1
2
install.packages(c('devtools', 'ProjectTemplate', 'knitr', 'ggplot2', 'reshape2',
                   'plyr', 'Rcpp', 'mvtnorm', 'caret'), dep = TRUE)

Then, I update my .Rprofile file, which I keep in a Github gist.

Finally, my coworkers focus on flow cytometry data, and our group maintains several Bioconductor packages related to this type of data. To install the majority of them, we simply install the flowWorkspace package in R:

1
2
source("http://bioconductor.org/biocLite.R")
biocLite("flowWorkspace")

Chapter 2 Solutions - Statistical Methods in Bioinformatics

As I have mentioned previously, I have begun reading Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. In this post, I will give my solution to two problems. The first problem is pretty straightforward.

Problem 2.20

Suppose that a parent of genetic type Mm has three children. Then the parent transmits the M gene to each child with probability 1/2, and the genes that are transmitted to each of the three children are independent. Let if children 1 and 2 had the same gene transmitted, and otherwise. Similarly, let if children 1 and 3 had the same gene transmitted, otherwhise, and let if children 2 and 3 had the same gene transmitted, otherwise.

The question first asks us to how that the three random variables are pairwise independent but not independent. The pairwise independence comes directly from the bolded phrase in the problem statement. Now, to show that the three random variables are not independent, denote by the probability that , . If we had independence, then the following statement would be true:

However, notice that the event in the lefthand side can never happen because if and , then must be 1. Hence, the lefthand side must equal 0, while the righthand side equals 1/8. Therefore, the three random variables are not independent.

The question also asks us to discuss why the variance of is equal to the sum of the individual variances. Often, this is only the case of the random variables are independent. But because the random variables here are pairwise independent, the covariances must be 0. Thus, the equality must hold.

Problems 2.23 - 2.27

While I worked the above problem because of its emphasis on genetics, the following set of problems is much more fun in terms of the mathematics because of its usage of approximations.

For , let be the th lifetime of certain cellular proteins until degradation. We assume that are iid random variables, each of which is exponentially distributed with rate parameter . Furthermore, let be an odd integer.

This set of questions is concerned with the mean and variance of the sample median, , where denotes the th order statistic. First, note that the mean and variance of the minimum value are and , respectively. From the memoryless property of the exponential distribution, the mean value of the time until the next protein degrades is independent of the previous. However, there are now proteins remaining. Thus, the mean and variance of are and , respectively. Continuining in this manner, we have

and

Approximation of

Now, we wish to approximate the mean with a much simpler formula. First, from (B.7) in Appendix B, we have

where is Euler’s constant. Then, we can write the expected sample median as

Hence, as , this approximation goes to , which is the median of an exponentially distributed random variable. Specifically, the median is the solution to , where denotes the cumulative distribution function of the random variable .

Improved Approximation of

It turns out that we can improve this approximation with the following two results:

Following the derivation of our above approximation, we have that

Approximation of

We can also approximate using the approximation

With and , we have

Textbook - Statistical Methods in Bioinformatics

As part of my effort to acquaint myself more with biology, bioinformatics, and statistical genetics, I am trying to find as many resources as I can that provide a solid foundation. For instance, I am wading through Molecular Biology of the Cell at a pace of about 10-15 pages per day – this takes nearly an hour every day.

I am also going through Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. My intention is to post my solutions to these chapter exercises. Thus far, I have made it through the first three chapters, and I will begin posting my solutions soon. I am interested particularly in problems regarding statistical topics with which I have little-to-no experience and also topics where I lack intuition regarding the biological applications.

Here is a thumbnail of the book:

Statistical Methods in Bioinformatics Textbook

Now That We Live in Seattle

It has been just a few weeks since my wife, my son, and I moved to Seattle so that I could begin my postdoc at The Hutch. Now that we have been here a short time and are settled, we intend to start exploring Seattle, doing typical touristy things as well as non-touristy activities that only Seattlites would do. My wife purchased a detailed guide to Seattle that lists numerous activities, restaurants, scenery, etc. that would take months (years?) to complete. I prefer word of mouth though, so I asked some coworkers for recommendations. Here’s what they gave me:

My coworkers recommended parking at The Hutch and riding The Slut to Westlake Center and then walking to Pike Place – avoids traffic. Also, they recommended thta we combine Discovery Park and Ballard into one day. They said the view of Seattle from the ferry when returning from Bainbridge Island is breathtaking.

My wife and I like to hike, so I also asked my coworkers for recommendations. Overwhelmingly, they said these two places:

Any other recommendations? In particular, are there any restaurants that you’d recommend?

And Now I Blog Again

One of my goals for 2012 has been to blog more. Much more. When I first set this goal, I had great aspirations of posting frequently. However, I had a Ph.D. to complete, and quite frankly, it demanded much higher priority. Now that I have submitted my dissertation and completed my Ph.D. requirements, I have several half-finished posts that will appear soon. Also, since I have made the switch to Octopress, I will be relocating selected posts from my previous Wordpress blog.

Goals for 2012

I have never been one to set New Year’s resolutions. Personally, they instill a dangerous personal freedom that often yield naive, subconscious mentalities, such as I can do anything I want until December 31, and I will change abruptly the next day. However, my Ph.D. adviser has shown me the importance of setting goals in all things that I wish to accomplish as well as envisioning the finale to an arduous journey like a small child (read “John Ramey”) that pictures the waning warmth of fresh chocolate chip cookies smeared on his face. My adviser has always encouraged short-term and long-term goals but never required them. As I recently found out, my employer does. In addition, these goals are reviewed at the end of the fiscal year so that employees are realistic and held accountable.

Now that I must list these formally at work, I have decided to post a number of career and personal goals here in order to hold myself accountable at the end of 2012 with the implicit assumption that the world does not end. So, one year from now, if we are still here (chuckle), I will review my goal-completion success.

  • Read to my son (almost) nightly.
  • Hear my son laugh (almost) nightly.
  • Take my wife out for a date night each week.
  • Treat my wife to a significant outing each month.
  • Finish dissertation.
  • Successfully defend dissertation.
  • Submit 4-6 articles for publication.
  • Submit at least 3 R packages to CRAN.
  • Attend at least three conferences.
  • Make at least two conference presentations.
  • Find/maintain employment.
  • Construct detailed plan and outline for my textbook.
  • Take a real vacation.
  • Run a half marathon. (I will consider this a success if I have signed up for an early 2013 race.)
  • Transition my personal website from Wordpress to Octopress.
  • Blog more.
  • Check Tweets and email twice daily at a scheduled time.
  • Spend time with my extended family.
  • Read the literature at a scheduled time.
  • Finish reading Izenman’s Multivariate text.
  • Read Lehmann’s Reminiscences of a Statistician: The Company I Kept.
  • Read Ewens and Grant’s bioinformatics text.
  • Read Bishop’s PRML text.
  • Read Gaussian Processes for Machine Learning.
  • Read Barber’s Bayesian Reasoning and Machine Learning.
  • Read a significant portion of Devroye et al.’s Probabilistic Theory of Pattern Recognition text.
  • Reread Robert’s The Bayesian Choice.
  • Read Jaynes’ Probability Theory text.
  • Read Berger’s Decision Theory text.
  • Watch Boyd’s Convex Optimization lectures.
  • Read the Boyd Convex Optimization text.
  • Finish reading Reamde.
  • Read Rothfuss’ The Name of the Wind.
  • Become more proficient at debugging in R.

As I have assuredly not remembered the goals that I have verbally set, this list may change over the next few days.