This is a repository for the workshop and lecture on Unsupervised Learning as taught at NGSchool2022: Machine Learning in Computational Biology on 16-17.09 in Jablonna, Poland.
Authors: Kasia Kędzierska and Kaspar Märtens
This tutorial, together with a proceeding lecture [slides can be found here], are jointly prepared and taught by Kaspar Märtens and Kasia Kędzierska. Kasia is a final year PhD student in Genomic Medicine and Statistics at Wellcome Centre for Human Genetics at the University of Oxford. Kaspar recently finished his PhD in Statistical Machine Learning at the University of Oxford. Since then he was a postdoctoral Research Fellow at the Alan Turing Institute and worked at Apple Health AI. Currently, he is based in the Big Data Institute at the University of Oxford.
We split the two 90 minutes sessions into a lecture and a workshop. In that space of time is quite difficult to cover the area so vast as Unsupervised Learning. Our goal here was to talk about the methods, explain their applications and some intuitions around them. In order to fully understand them we would recommend exploring each method in more detail in the materials we link below.
What do we cover/explore?
- Dimensionality reduction:
- Linear: PCA
- Non-linear: tSNE, UMAP
- Clustering:
- K-means
- Hierarchical clustering
The tutorial is self contained and you should be able to run at home as well. There are some questions and exercises there, as well as the points to ponder about. We would discuss them all at the workshop.
In order to be able to run this tutorial you need:
RStudio
v1.0.136 or laterR
>= 4.0- and a few
R
packages.
The packages you need are all listed in R_packages_list.txt
file in
scripts
directory.
We also prepared the prep_help.R
script that will check if you have
all necessary packages, and if not will try to install them. You can
either open the script with RStudio and click Run
or use command line:
Rscript --vanilla scripts/prep_help.R
It might also be good to save the output of the script for potential
debugging, you can use tee
, for example, to copy the output of the
command line to the file.
Rscript --vanilla scripts/prep_help.R |& tee prep_help.log
You are all done if the last message you saw was:
SUCCESS: Fantastic! All packages installed and ready.
You don’t have to worry about your setup - just follow the NGSchool2022 IT instructions and pull the appropriate docker image. All the packages are already installed there.
The repository contains few files:
notebooks/00_data_preparation
- this is the file with code used to download, prepare and normalise the data for the tutorial. We usedTCGAbiolinks
package that can access GDC data and download & read it in your R session directly.notebooks/01_unsupervised_learning_in_R
- both slides and Quarto file with code that generated the slides with unsupervised learning in R using Palmer Penguins data set.notebooks/tutorial/tutorial.Rmd
- the self contained tutorial where we look at the TCGA BRCA data set & annotation from this paper.
- How to explain PCA to your grandmother?
- Nice blog post on PCA
- Comparing UMAP and tSNE for single cell
- Interactive PCA visualisations on setosa.io
- Understanding PCA using Shiny and Stack Overflow data
Generative art
- Lior Pachter: “Tl;dr: definitely time to stop making t-SNE & UMAP plots.” and some responses: Dmitry Kobak: “I still think this claim is absurd”
Here are some pointers to literature on topics that Kaspar briefly mentioned
- Modern non-linear extensions of PCA: Variational Autoencoders
- Bayesian statistics (review paper)
- Bayesian non-parametrics (MLSS tutorial)