Lately, I’ve been delving into Kaggle and found myself frustrated with the manual data download process via their website. Preferring a more programmatic approach, I sought a solution and discovered the recommendation to use lynx. However, my friend Anthony suggested an alternative: writing a Python script.

Despite Python not being my primary language, I was intrigued by how straightforward it was to craft the script using requests.py. In this instance, I aimed to download the training data set from Kaggle’s Digit Recognizer competition.

The approach is simple:

  1. Try to download a file from Kaggle, but encounter a blockage due to lack of login credentials.
  2. Log in using requests.py.
  3. Proceed to download the data.

Below is the code snippet:

python
import requests

# The direct link to the Kaggle data set
data_url = 'http://www.kaggle.com/c/digit-recognizer/download/train.csv'

# The local path where the data set is saved.
local_filename = "train.csv"

# Kaggle Username and Password
kaggle_info = {'UserName': "my_username", 'Password': "my_password"}

# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)

# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=kaggle_info, prefetch=False)

# Writes the data to a local file one chunk at a time.
with open(local_filename, 'wb') as f:
    for chunk in r.iter_content(chunk_size=512 * 1024):  # Reads 512KB at a time into memory
        if chunk:  # filter out keep-alive new chunks
            f.write(chunk)

Simply replace “my_username” and “my_password” with your Kaggle login credentials. Feel free to adjust the chunk size according to your preferences.