In my pursuit of machine learning research, I often delve into small-sample, high-dimensional bioinformatics datasets. A significant portion of my work focuses on exploring new methodologies tailored to these datasets. For example, I’ve published a paper discussing this very topic.
Many studies in the field of machine learning rely heavily on two prominent datasets: the Alon colon cancer dataset and the Golub leukemia dataset. Despite their popularity, both datasets were introduced in papers published back in 1999. This indicates a potential mismatch between existing methodologies and the advancements in data collection technology. Moreover, the Golub dataset, while widely used, isn’t ideal as a benchmark due to its well-separated nature, leading to nearly perfect classification by most methods.
To address this gap, I embarked on a mission to discover alternative datasets that could serve as valuable resources for researchers like myself. What initially started as a small-scale project quickly evolved into something more substantial. As a result, I’ve curated a collection of datasets and packaged them conveniently for easy access and analysis. This effort culminated in the creation of the datamicroarray
package, which is now available on my GitHub account.
Each dataset included in the package comes with a script for downloading, cleaning, and storing the data as a named list. For detailed instructions on data storage and usage, refer to the README file provided with the package. Currently, the datamicroarray
package comprises 20 datasets specifically tailored for assessing machine learning algorithms and models in the context of small-sample, high-dimensional data.
Additionally, I’ve supplemented the package with a comprehensive wiki hosted on the GitHub repository. This wiki serves as a valuable resource, offering detailed descriptions of each dataset along with additional information, including links to the original papers for reference.
One challenge I’ve encountered is the large file size of the R package, primarily due to storing an RData file for each dataset. To mitigate this issue, I’m actively exploring alternative approaches for dynamically downloading data. I welcome any suggestions or contributions from the community in this regard. Additionally, I must acknowledge that some data descriptions within the package are incomplete, and I would greatly appreciate assistance in enhancing them.
Researchers are encouraged to leverage any of the datasets provided in the datamicroarray
package for their work. However, it’s essential to ensure proper data processing before conducting analysis and incorporating the results into research endeavors.