PyData - Santiago #2

This tutorial contains 2 notebooks Presentation.ipynb and NycTaxi.ipynb.

Presentation

On the first notebook we generate a 20M dummy dataset and show how dask.delayed and dask.dataframe work.

NycTaxi

In order to run this notebook we need to download

each one is a CSV of roughly 10 GB and 112M rows and they have the same scheme (columns order and dtypes). The goal is to read this data and do some analytics on a old laptop with 16GB of RAM and a slow cpu Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz. With this machine is neither possible to load a single file. The focus is on the workflow which is listed below

Dask workflow

Split the data in smaller files
Change dtypes and convert to parquet
Clean the data and (again) save to parquet
Take advantage of the fact that parquet is a columnar storage format.

References

Prepare

You are supposed to have anaconda or miniconda installed. Then the following packages

conda install -c conda-forge jupyterlab
conda install -c conda-forge nodejs 
jupyter labextension install @jupyterlab/toc # Optional
jupyter labextension install dask-labextension

Load the environment with conda env create -f environment.yml and add the kernel to jupyter via

conda activate pydata_stg && \
python -m ipykernel install --user --name pydata_stg && \
conda deactivate

List of packages used

dask
pyarrow
python-graphviz # optional
holidays # cos we need them!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
NycTaxi.ipynb		NycTaxi.ipynb
Presentation.ipynb		Presentation.ipynb
README.md		README.md
environment.yml		environment.yml
mydask.png		mydask.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyData - Santiago #2

Presentation

NycTaxi

References

Prepare

List of packages used

About

Releases

Packages

Languages

rpanai/PyDataSTG

Folders and files

Latest commit

History

Repository files navigation

PyData - Santiago #2

Presentation

NycTaxi

References

Prepare

List of packages used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages