High-Dimensional Microarray Data Sets in R for Machine Learning

My primary focus in machine learning research revolves around analyzing small-sample, high-dimensional bioinformatics data sets. An example of my work in this domain can be found in one of my published papers.

It’s worth noting that a considerable portion of papers proposing novel machine learning methodologies for high-dimensional data primarily rely on two well-known data sets: the Alon colon cancer data set and the Golub leukemia data set. Both of these data sets were introduced in papers published back in 1999. However, the continued use of these data sets suggests a gap in keeping up with advancements in data collection technologies. Additionally, the Golub data set’s characteristics make it less suitable as a benchmark due to its high separation, leading to nearly perfect classification for most methods.

In response to this gap, my objective has been to identify and provide several alternative data sets conveniently, allowing for easy loading, analysis, and integration into research papers. Initially, my intention was to compile a few additional data sets. However, as I delved deeper into this endeavor, I discovered a plethora of suitable options. What began as a modest project has evolved into a valuable resource, saving significant time and effort. I’ve developed the datamicroarray package, accessible via my GitHub account. For each data set within the package, I’ve created a script facilitating download, cleaning, and storage of the data as a named list. For further details on the data organization, please refer to the README file.

Presently, the package encompasses 20 small-sample, high-dimensional data sets, ideal for evaluating various machine learning algorithms and models. Additionally, I’ve established a wiki within the package’s GitHub repository, providing comprehensive descriptions of each data set along with additional information and links to the original papers.

One limitation to be mindful of is the size of the R package, owing to the storage of an RData file for each data set. I’m actively exploring alternative approaches to dynamically downloading the data and welcome any suggestions in this regard. It’s also worth noting that while the data descriptions are provided, they may be incomplete, and any assistance in enhancing them would be greatly appreciated.

Feel free to utilize any of the data sets provided. However, it’s important to ensure proper data processing before analyzing and incorporating the results into your own research endeavors.