Data Development Archives - Ramhise Blog on statistics and machine learning Mon, 15 Apr 2024 07:24:40 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 https://ramhiser.com/wp-content/uploads/2024/04/cropped-graph-7128343_640-32x32.png Data Development Archives - Ramhise 32 32 High-Dimensional Microarray Data Sets in R for Machine Learning https://ramhiser.com/2012/12/29/high-dimensional-microarray-data-sets-in-r-for-machine-learning/ Sun, 20 Aug 2023 07:22:00 +0000 https://ramhiser.com/?p=29 My primary focus in machine learning research revolves around analyzing small-sample, high-dimensional bioinformatics data sets.

The post High-Dimensional Microarray Data Sets in R for Machine Learning appeared first on Ramhise.

]]>
My primary focus in machine learning research revolves around analyzing small-sample, high-dimensional bioinformatics data sets. An example of my work in this domain can be found in one of my published papers.

It’s worth noting that a considerable portion of papers proposing novel machine learning methodologies for high-dimensional data primarily rely on two well-known data sets: the Alon colon cancer data set and the Golub leukemia data set. Both of these data sets were introduced in papers published back in 1999. However, the continued use of these data sets suggests a gap in keeping up with advancements in data collection technologies. Additionally, the Golub data set’s characteristics make it less suitable as a benchmark due to its high separation, leading to nearly perfect classification for most methods.

In response to this gap, my objective has been to identify and provide several alternative data sets conveniently, allowing for easy loading, analysis, and integration into research papers. Initially, my intention was to compile a few additional data sets. However, as I delved deeper into this endeavor, I discovered a plethora of suitable options. What began as a modest project has evolved into a valuable resource, saving significant time and effort. I’ve developed the datamicroarray package, accessible via my GitHub account. For each data set within the package, I’ve created a script facilitating download, cleaning, and storage of the data as a named list. For further details on the data organization, please refer to the README file.

Presently, the package encompasses 20 small-sample, high-dimensional data sets, ideal for evaluating various machine learning algorithms and models. Additionally, I’ve established a wiki within the package’s GitHub repository, providing comprehensive descriptions of each data set along with additional information and links to the original papers.

One limitation to be mindful of is the size of the R package, owing to the storage of an RData file for each data set. I’m actively exploring alternative approaches to dynamically downloading the data and welcome any suggestions in this regard. It’s also worth noting that while the data descriptions are provided, they may be incomplete, and any assistance in enhancing them would be greatly appreciated.

Feel free to utilize any of the data sets provided. However, it’s important to ensure proper data processing before analyzing and incorporating the results into your own research endeavors.

The post High-Dimensional Microarray Data Sets in R for Machine Learning appeared first on Ramhise.

]]>
How to Download Kaggle Data with Python and requests.py https://ramhiser.com/2012/11/23/how-to-download-kaggle-data-with-python-and-requests-dot-py/ Tue, 11 Jul 2023 07:20:00 +0000 https://ramhiser.com/?p=26 Lately, I've been delving into Kaggle and found myself frustrated with the manual data download process via their website.

The post How to Download Kaggle Data with Python and requests.py appeared first on Ramhise.

]]>
Lately, I’ve been delving into Kaggle and found myself frustrated with the manual data download process via their website. Preferring a more programmatic approach, I sought a solution and discovered the recommendation to use lynx. However, my friend Anthony suggested an alternative: writing a Python script.

Despite Python not being my primary language, I was intrigued by how straightforward it was to craft the script using requests.py. In this instance, I aimed to download the training data set from Kaggle’s Digit Recognizer competition.

The approach is simple:

  1. Try to download a file from Kaggle, but encounter a blockage due to lack of login credentials.
  2. Log in using requests.py.
  3. Proceed to download the data.

Below is the code snippet:

python
import requests

# The direct link to the Kaggle data set
data_url = 'http://www.kaggle.com/c/digit-recognizer/download/train.csv'

# The local path where the data set is saved.
local_filename = "train.csv"

# Kaggle Username and Password
kaggle_info = {'UserName': "my_username", 'Password': "my_password"}

# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)

# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=kaggle_info, prefetch=False)

# Writes the data to a local file one chunk at a time.
with open(local_filename, 'wb') as f:
    for chunk in r.iter_content(chunk_size=512 * 1024):  # Reads 512KB at a time into memory
        if chunk:  # filter out keep-alive new chunks
            f.write(chunk)

Simply replace “my_username” and “my_password” with your Kaggle login credentials. Feel free to adjust the chunk size according to your preferences.

The post How to Download Kaggle Data with Python and requests.py appeared first on Ramhise.

]]>
Setting Up the Development Version of R https://ramhiser.com/2012/08/28/setting-up-the-development-version-of-r/ Mon, 19 Jun 2023 07:18:00 +0000 https://ramhiser.com/?p=23 At Fred Hutchinson, my colleagues often utilize the development version of R, known as R-devel, and have encouraged me to follow suit. In this post

The post Setting Up the Development Version of R appeared first on Ramhise.

]]>
At Fred Hutchinson, my colleagues often utilize the development version of R, known as R-devel, and have encouraged me to follow suit. In this post, I’ll outline how I’ve configured the development version of R on our Linux server, which I access remotely due to its superior performance compared to my Mac.

To begin, I fetched the R-devel source code using Subversion and stored it in ~/local/ (equivalent to /home/jramey/local/), then proceeded to configure and compile the source. If you’re building from source, I recommend checking out these Subversion tips. Below are the commands I used to install R-devel:

bash
svn co https://svn.r-project.org/R/trunk ~/local/R-devel
cd ~/local/R-devel
./tools/rsync-recommended
./configure --prefix=/home/jramey/local/
make
make install

The third command is crucial as it downloads the recommended R packages, which are not included in the SVN repository. For further details, refer to this resource.

While we have the release version (currently 2.15.1) installed in /usr/local/bin, our objective is to prioritize R-devel. To achieve this, I appended the following lines to my ~/.bashrc file:

bash
PATH=~/local/bin:$PATH
export PATH

# Never save or restore when running R
alias R='R --no-save --no-restore-data --quiet'

Note the inclusion of the final line in ~/.bashrc, ensuring that R-devel is loaded quietly without saving or restoring.

Subsequently, I proceeded to install the R packages I frequently use:

R
install.packages(c('devtools', 'ProjectTemplate', 'knitr', 'ggplot2', 'reshape2',
                   'plyr', 'Rcpp', 'mvtnorm', 'caret'), dependencies = TRUE)

Following this, I updated my .Rprofile file, which I maintain in a GitHub gist.

Lastly, given our focus on flow cytometry data, and our group’s maintenance of several Bioconductor packages related to this domain, installing these packages is straightforward. We typically install the flowWorkspace package in R using the following command:

R
source("http://bioconductor.org/biocLite.R")
biocLite("flowWorkspace")

The post Setting Up the Development Version of R appeared first on Ramhise.

]]>
Steve Jobs’ 2005 Stanford Commencement Address https://ramhiser.com/2011/12/04/steve-jobs-2005-stanford-commencement-address/ Tue, 04 Apr 2023 07:16:00 +0000 https://ramhiser.com/?p=20 Having just stumbled upon Steve Jobs' 2005 Stanford Commencement Address, a speech that has garnered nearly 13 million views, I find myself grateful for the timing.

The post Steve Jobs’ 2005 Stanford Commencement Address appeared first on Ramhise.

]]>
Having just stumbled upon Steve Jobs’ 2005 Stanford Commencement Address, a speech that has garnered nearly 13 million views, I find myself grateful for the timing. Had I encountered it earlier, I might not have fully appreciated the depth of his insights. Among the plethora of memorable quotes, a few resonate deeply with me, echoing the wise counsel of my grandmother from years past.

“Don’t settle.”

“Stay hungry. Stay foolish.”

These words strike a chord, serving as a poignant reminder of a fundamental truth: the pursuit of passion and purpose should never wane. Jobs’ admonition to seek out what one loves resonates powerfully, whether in matters of career or matters of the heart. Indeed, our work occupies a significant portion of our lives, and true satisfaction can only be attained by dedicating ourselves to endeavors we deem truly meaningful.

It’s refreshing to revisit such timeless wisdom, a gentle nudge to reassess our priorities and reignite our pursuit of greatness.

As an aside, it’s worth noting that due to the migration to Jekyll 2.0 on GitHub pages, certain functionalities, such as the YouTube plugin previously utilized, are currently disabled. While I had hoped to embed the video directly, perhaps in the future, technological advancements will afford us that luxury once again.

The post Steve Jobs’ 2005 Stanford Commencement Address appeared first on Ramhise.

]]>