This tutorial contains 2 notebooks Presentation.ipynb
and NycTaxi.ipynb
.
On the first notebook we generate a 20M dummy dataset and show how dask.delayed
and dask.dataframe
work.
In order to run this notebook we need to download
each one is a CSV of roughly 10 GB and 112M rows and they have the same scheme (columns order and dtypes). The goal is to read this data and do some analytics on a old laptop with 16GB of RAM and a slow cpu Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz
. With this machine is neither possible to load a single file. The focus is on the workflow which is listed below
Dask workflow
- Split the data in smaller files
- Change
dtypes
and convert to parquet - Clean the data and (again) save to parquet
- Take advantage of the fact that
parquet
is a columnar storage format.
You are supposed to have anaconda or miniconda installed. Then the following packages
conda install -c conda-forge jupyterlab
conda install -c conda-forge nodejs
jupyter labextension install @jupyterlab/toc # Optional
jupyter labextension install dask-labextension
Load the environment with conda env create -f environment.yml
and add the kernel to jupyter via
conda activate pydata_stg && \
python -m ipykernel install --user --name pydata_stg && \
conda deactivate
dask
pyarrow
python-graphviz
# optionalholidays
# cos we need them!