Apache Spark is a powerful tool for handling large-scale data processing, and PySpark offers a convenient way to interact with Spark using Python. Recently, a colleague and I wondered if it was possible to run PySpark from within an IPython Notebook. As it turns out, setting up PySpark support in an IPython profile is quite straightforward.

Here’s a quick summary of the process:

  1. Install Spark: Download the latest Spark release source, unzip it to your desired location (e.g., ~/spark-1.2.0/), install the Scala build tool (sbt) using Homebrew, and build Spark using sbt assembly.
  2. Create PySpark Profile for IPython: Start by creating a new IPython profile for PySpark using the command ipython profile create pyspark. To avoid port conflicts, update the default port to 42424 within ~/.ipython/profile_pyspark/ipython_notebook_config.py. Set the necessary environment variables in .bashrc or .bash_profile, including SPARK_HOME and PYSPARK_SUBMIT_ARGS. Create a file named ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py to configure the Spark environment.
  3. Launch IPython Notebook with PySpark Profile: Use the command ipython notebook –profile=pyspark to launch a notebook with the PySpark profile.
  4. Word Count Example: Ensure that the PySpark profile has created a SparkContext by typing sc within the notebook. Then, load a text file into a Spark RDD (e.g., the Spark README file). Execute a word count script to split each line into words, map each word to a tuple with an initial count of 1, and sum up the counts for each word. Finally, execute the pipeline to count the words.

With these steps, you can configure IPython Notebook support for PySpark and perform tasks like word count analysis seamlessly within the notebook environment.