## November 23, 2012

Recently I started playing with Kaggle. I quickly became frustrated that in order to download their data I had to use their website. I prefer instead the option to download the data programmatically. After some Googling, the best recommendation I found was to use lynx. My friend Anthony recommended that alternatively I should write a Python script.

Although Python is not my primary language, I was intrigued by how simple it was to write the script using requests.py. In this example, I download the training data set from Kaggle’s Digit Recognizer competition.

The idea is simple:

1. Attempt to download a file from Kaggle but get blocked because you are not logged in.

Here’s the code:

import requests

# The direct link to the Kaggle data set

# The local path where the data set is saved.
local_filename = "train.csv"

# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)

# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data = kaggle_info, prefetch = False)

# Writes the data to a local file one chunk at a time.
f = open(local_filename, 'w')
for chunk in r.iter_content(chunk_size = 512 * 1024): # Reads 512KB at a time into memory
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.close()

Simply change my_username and my_password to your Kaggle login info. Feel free to optimize the chunk size to your liking.