ramhiser.com

John Ramey, Ph.D. Candidate in Statistics at Baylor University. High-Dimensional Statistics, Statistical Pattern Recognition and Machine Learning.

Autocorrelation Matrix in R

View Comments

I have been simulating a lot of data lately  with various covariance (correlation) structures, and one that I have been using is the autocorrelation (or autoregressive) structure, where there is a “lag” between variables. The matrix is a v-dimension matrix of the form

$$\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{v-1}\\ \rho & 1& \ddots & \dots & \rho^{v-2}\\ \vdots & \ddots & \ddots & \ddots & \vdots\\ \rho^{v-2} & \dots & \ddots & \ddots & \rho\\ \rho^{v-1} & \rho^{v-2} & \dots & \rho & 1 \end{bmatrix}$$,

where \(\rho \in [-1, 1]\) is the lag. Notice that the lag decays to 0 as v increases.

My goal was to make the construction of such a matrix simple and easy in R.  The method that I used explored a function I have not used yet in R called “lower.tri” for the lower triangular part of the matrix.  The upper triangular part is referenced with “upper.tri.”

My code is as follows:

autocorr.mat <- function(p = 100, rho = 0.9) {
  mat <- diag(p)
  autocorr <- sapply(seq_len(p), function(i) rho^i)
  mat[lower.tri(mat)] <- autocorr[unlist(sapply(seq.int(p-1, 1), function(i) seq_len(i)))]
  mat[upper.tri(mat)] <- autocorr[unlist(sapply(seq_len(p - 1), function(i) seq.int(i, 1)))]
  return(mat)
}

I really liked it because I feel that it is simple, but then I found Professor Peter Dalgaard’s method, which I have slightly modified. It is far better than mine, easy to understand, and slick. Oh so slick. Here it is:

autocorr.mat <- function(p = 100, rho = 0.9) {
  mat <- diag(p)
  return(rho^abs(row(mat)-col(mat)))
}

Professor Dalgaard’s method puts mine to shame. It is quite obvious how to do it once it is seen, but I certainly wasn’t thinking along those lines.

Principal Component Analysis vs Linear Discriminant Analysis for Dimension Reduction

View Comments

Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found this article in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) that compares Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) in facial recognition.  Published in 2001, it is a bit dated.  However, there are few papers (to my knowledge) with such a specific focus.

Before we discuss the paper further, let’s take a look at a summary of LDA and PCA.

The goal of LDA is to find a linear projection from the feature space (with dimension \(p\)) to a subspace of dimension \(C – 1\), where \(C\) is the number of classes, that maximizes the separability of the classes. It must be noted that LDA is often advertised as a Gaussian parametric model, but Fisher only assumed homoscedastic populations; that is, he assumed that the covariance matrices of each class are equal. We refer to the common covariance matrix as \(\mathbf{\Sigma}\).  However, under the homoscedastic Gaussian assumption, LDA can be found to be the maximum likelihood method. In practice this covariance matrix must be estimated with data because it is unknown; the estimated covariance matrix is often called the pooled sample covariance matrix, \(\mathbf{S}_p\). Of course, when the sample size \(N\) is large relative to the dimension of the feature space (the number of variables) \(p\), this estimation is excellent.  However, when \(p > N\), \(\mathbf{S}_p\) is singular, which causes a problem for the method.  Often the inverse of this estimate is replaced with the Moore-Penrose pseudoinverse or is regularized.   In the modern, high-dimensional case where \(p >> N\), this estimation is terrible.

A good overview of LDA is given here.

Wikipedia defines PCA nicely:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA essentially rotates the data (via a linear transformation) so that most of the variability in the data is contained in as few dimensions as possible.  For dimension reduction purposes, the usual practice is to drop the remaining dimensions containing little variability (the dimensions that correspond to the smallest eigenvalues) because they are highly correlated with the remaining dimensions. To borrow from Wikipedia once again,

PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance.

A good overview of PCA can be found here.

The problem that I have with PCA for dimension reduction in the classification context, which the PAMI paper considers, is that it ignores the response, and thus the eigenvectors (and corresponding eigenvalues) are found after considering the features as one data set. In other words, the training data is treated as if it all comes from the same population, which can be especially problematic in the multiclass classification setting. The paper acknowledges this issue:

Of late, there has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals with the data in its entirety for the principal components analysis without paying any particular attention to the underlying class structure.

The paper then makes the claim that

we will show that the switch from PCA to LDA may not always be warranted and may sometimes lead to faulty system design, especially if the size of the learning database is small.

I have no qualms about their claim and their subsequent results.  However, there is no acknowledgement about the poor estimation of \(\mathbf{S}_p\), which leads to poor performance of LDA in the \(p >> n\) case.  There have been many suggestions on how to improve this estimation, and often shrinkage methods significantly improve the estimation of \(\mathbf{\Sigma}\). LDA is not always the best choice either because of the need to pool covariance matrices: if the covariance matrix for each class describe very different shapes, then pooling essentially is a weighted average of the shapes, which may lead to a new shape not representative of any class. (This is similar to the classic independent two-sample t-test, where a pooled sample variance is used.)

It would be interesting to see a follow-up study done with the appropriate regularizations performed with LDA and PCA in the \(p >> N\) case.

As a side note, I find it humorous that these methods are often paired against each other.  Two bitter enemies, R. A. Fisher and Karl Pearson, developed LDA and PCA, respectively.  My favorite quote, which can be found in Agresti’s Categorical Data Analysis (p. 622), within the rivalry is Pearson’s response to a Fisher criticism:

I hold that such a view [Fisher's] is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the Journal of the Royal Statistical Society. … I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors…

classify – v. 0.01

View Comments

I have spent a lot of my time lately working on an R package that I am calling “classify.” (To see the latest build on GitHub, check out my GitHub page for classify) Its purpose is to be a mini-framework that acts as a supervised learning classification suite where the performance of classifiers can be evaluated in many simulation and real-world settings.  Because there are so many settings and so many metrics by which to compare classifiers, it makes sense to automate these tasks a little. For example, Professor Jerome Friedman wrote a paper about Regularized Discriminant Analysis (RDA) and discussed six multivariate normal parameter configurations from which he sampled and then compared the classifier performance of RDA with the classic methods Linear Discriminant Analysis and Quadratic Discriminant Analysis. I am creating an R package that only requires the end-user to supply a small plugin, which focuses only on how RDA works and not all of the details surrounding the simulation study.  I can then add more simulation settings and more real data sets in a plug-in fashion and simply tell instruct the RDA plug-in to run against all of those as well.

For a basic study, this process is not difficult, and even if one performs a few of these studies, the act of copying and pasting code from previous studies seems to be fine.  But as this is done more and more, simply copying and pasting is error-prone and leads to loads of bloated, unreadable, unmanageable code. So, when a new classifier is proposed, would it not be incredibly nice to compare that classifier to ALL previous classifiers in ALL previous settings? This tool will be even nicer when various error rate estimators, such as .632, .632+, and crossvalidation, can be computed with no addition of code. Essentially, this is a plugin system that will at minimum increase my own productivity as well as my R experience.

Ideally, I’d like to have an entire database filled with performances of various classifiers in many situations, so that I can recommend to future clients, “Hey, we shouldn’t be using a support vector machine in this situation, even though it may be the new thing.” Yes, one can look in the literature for these results, but with the limitation of journal space, the studies often are limited in scope. An individual journal article may compare 5-ish classifiers in 6 simulation settings as well as against 2-3 real data sets, but it is often difficult to distinguish a clear winner without many further studies.

There are packages in MATLAB that do this, such as PRTools, as well as external software applications, but I prefer R at this point.  Also, I note that I’m not seeking to recreate PRTools, though I do wish to implement a very light version of it. The ideal version of “classify” on which I have written copious notes is dandy, but in the words of Kill Bill, “First, wiggle your big toe.” Back to the grind.

Initial Post

View Comments

This blog serves as a platform to post my current endeavors, most of which will be related to my profession — statistics.  My goal is to increase my experience analyzing data and to improve my knowledge and familiarity about methods that I did not learn in school.  Some methods might have been glossed over in class for sake of time, or I simply was preoccupied.

I do not expect the analyses to be perfect. In fact I am not looking for perfection; I will not sacrifice quality for quantity of analyses necessarily, but I prefer not to waste time on minutiae.  If you see that I have made an error or are willing to suggest an improvement in my analysis or code, please let me know.  I appreciate feedback and constructive criticism, especially because I am new to some of these methods.