Configuring IPython Notebook Support for PySpark01 Feb 2015
Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. After a discussion with a coworker, we were curious whether PySpark could run from within an IPython Notebook. It turns out that this is fairly straightforward by setting up an IPython profile.
- Install Spark
- Create PySpark profile for IPython
- Some config
- Simple word count example
The steps below were successfully executed using Mac OS X 10.10.2 and Homebrew. The majority of the steps should be similar for non-Windows environments. For demonstration purposes, Spark will run in local mode, but the configuration can be updated to submit code to a cluster.
Many thanks to my coworker Steve Wampler who did much of the work.
- Download the source for the latest Spark release
- Unzip source to
~/spark-1.2.0/(or wherever you wish to install Spark)
- From the CLI, type:
- Install the Scala build tool:
brew install sbt
- Build Spark:
sbt assembly(Takes a while)
Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.
To avoid port conflicts with other IPython profiles, I updated the default port
Set the following environment variables in
Create a file named
~/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:
Now we are ready to launch a notebook using the PySpark profile
Word Count Example
Make sure the ipython
pyspark profile created a SparkContext by typing
within the notebook. You should see output similar to
<pyspark.context.SparkContext at 0x1097e8e90>.
Next, load a text file into a Spark RDD. For example, load the Spark README file:
The word count script below is quite simple. It takes the following steps:
- Split each line from the file into words
- Map each word to a tuple containing the word and an initial count of 1
- Sum up the count for each word
At this point, the word count has not been executed (lazy evaluation). To actually count the words, execute the pipeline:
Here’s a portion of the output: