Jupyter Notebook Python + Spark for Toulouse Data Science workshop

What it Gives You

Jupyter Notebook 3.2
Conda Python 2.7.x
pyspark, pandas, matplotlib, scipy, seaborn, scikit-learn pre-installed
Spark 1.6.0 for use in local mode
Unprivileged user jovyan (uid=1000, configurable, see options) in group users (gid=100) with ownership over /home/jovyan and /opt/conda

The following command starts a container with the Notebook server listening for HTTP connections on port 8888 without authentication configured.

docker run -d -p 8888:8888 -p 4040:4040 noleto/pyspark-jupyter

This configuration is nice for using Spark on small, local data.

For example, the first few cells in a Python 2 notebook might read:

# do something to prove it works
rdd = sc.parallelize(xrange(1000))
rdd.takeSample(False, 5)