Hacking PySpark inside Jupyter Notebook

Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science:

1
IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g

Hacking PySpark inside Jupyter Notebook

Install Jupyter

If you are a pythoner, I highly recommend installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Go to https://www.continuum.io/downloads, find the instructions for downloading and installing Anaconda (jupyter will be included):

1
2
3
4
5
6
7
8
9
$ wget https://{somewhere}/Anaconda2-2.4.1-MacOSX-x86_64.sh
$ bash Anaconda2-2.4.1-MacOSX-x86_64.sh
$ python
Python 2.7.11 |Anaconda 2.4.1 (x86_64)| (default, Dec 6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>>

You can easily run Jupyter notebook:

1
$ jupyter notebook # Go to http://localhost:8888

jupyter notebook

Install Spark

If you are not familiar with spark, you can go to read spark offical documents:

Here is a simply instruction for installing spark:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# MacOS
$ brew install apache-spark
# Linux
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
$ tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
$ vim .bashrc
export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/sbin:$PATH
export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/bin:$PATH
$ source .bashrc
# Run PySpark shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

Launch PySpark inside IPython(jupyter)

Launch the PySpark shell in IPython:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ PYSPARK_DRIVER_PYTHON=ipython pyspark
or
$ IPYTHON=1 pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
In [1]:

Launch the PySpark shell in IPython Notebook, http://localhost:8888:

1
2
3
4
5
6
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
or
$ IPYTHON_OPTS="notebook" pyspark
# You can also specify running memory
$ IPYTHON_OPTS="notebook" pyspark --executor-memory 7g

Run PySpark on a cluster inside IPython(jupyter)

It’s assumed you deployed a spark cluster in standalone mode, and the master ip is localhost.

1
2
3
4
5
6
7
IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g
# you can add some python modules
IPYTHON_OPTS="notebook" pyspark \
--master spark://localhost:7077 \
--executor-memory 7g \
--py-files tensorflow-py2.7.egg

PySpark on cluster

Contents
  1. 1. Install Jupyter
  2. 2. Install Spark
  3. 3. Launch PySpark inside IPython(jupyter)
  4. 4. Run PySpark on a cluster inside IPython(jupyter)
|