In my recent postdoctoral work focusing on data-driven automated gating, I’ve been extensively exploring finite mixture models. Despite only having a cursory introduction to mixture models in academic settings, I’ve found myself increasingly adept at working with them. This prompted me to delve into their application in classification tasks, particularly in scenarios where a single class comprises multiple non-adjacent subclasses.
To my knowledge, there are two primary approaches—albeit with numerous variants—to applying finite mixture models for classification:
- The Fraley and Raftery approach, implemented in the mclust R package.
- The Hastie and Tibshirani approach, implemented in the mda R package.
While both methods share similarities, I opted to delve into the latter approach. Here’s the gist: we consider K≥2 classes, each assumed to be a Gaussian mixture of subclasses. This generative model formulation leverages the posterior probability of class membership for classification of unlabeled observations. Each subclass is assumed to possess its own mean vector, with all subclasses sharing a common covariance matrix to maintain model parsimony. The model parameters are estimated via the Expectation-Maximization (EM) algorithm.
While diving into the intricacies of likelihood in the associated literature, I encountered some confusion regarding how to formulate the likelihood to determine each observation’s contribution to estimating the common covariance matrix in the EM algorithm’s M-step. If each subclass had its own covariance matrix, the likelihood would be straightforward—a simple product of individual class likelihoods. However, my confusion stemmed from crafting the complete data likelihood when classes share parameters.
To address this, I documented the likelihood explicitly and elucidated the details of the EM algorithm utilized for estimating model parameters. This document is readily available, alongside LaTeX and R code, via the provided link. Should you choose to peruse the document, I welcome any feedback regarding confusing or poorly defined notations. Please note that I’ve omitted additional topics on reduced-rank discrimination and shrinkage.
To evaluate the efficacy of the mixture discriminant analysis (MDA) model, I devised a simple toy example featuring three bivariate classes, each comprising three subclasses. These subclasses were strategically positioned to ensure non-adjacency within each class, resulting in non-Gaussian class distributions. My aim was to assess the MDA classifier’s ability to identify subclasses and compare its decision boundaries with those of linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), implemented using the MASS package. From the scatterplots and decision boundaries depicted below, LDA and QDA classifiers exhibited expectedly puzzling decision boundaries, whereas the MDA classifier effectively identified subclasses. It’s worth noting that in this example, all subclasses share the same covariance matrix, aligning with the MDA classifier’s assumption. Exploring the classifier’s sensitivity to deviations from this assumption and its performance as feature dimensionality surpasses sample size would be intriguing avenues for future investigation.