Lately, I’ve been delving into Kaggle and found myself frustrated with the manual data download process via their website. Preferring a more programmatic approach, I sought a solution and discovered the recommendation to use lynx. However, my friend Anthony suggested an alternative: writing a Python script.
Despite Python not being my primary language, I was intrigued by how straightforward it was to craft the script using requests.py. In this instance, I aimed to download the training data set from Kaggle’s Digit Recognizer competition.
The approach is simple:
- Try to download a file from Kaggle, but encounter a blockage due to lack of login credentials.
- Log in using requests.py.
- Proceed to download the data.
Below is the code snippet:
python
import requests
# The direct link to the Kaggle data set
data_url = 'http://www.kaggle.com/c/digit-recognizer/download/train.csv'
# The local path where the data set is saved.
local_filename = "train.csv"
# Kaggle Username and Password
kaggle_info = {'UserName': "my_username", 'Password': "my_password"}
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=kaggle_info, prefetch=False)
# Writes the data to a local file one chunk at a time.
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=512 * 1024): # Reads 512KB at a time into memory
if chunk: # filter out keep-alive new chunks
f.write(chunk)
Simply replace “my_username” and “my_password” with your Kaggle login credentials. Feel free to adjust the chunk size according to your preferences.