xcms-preprocessing-ioslides.Rmd

---
title: "Metabolomics data pre-processing using xcms"
author: "Johannes Rainer<br><strong>Eurac Research</strong>, Bolzano, Italy<br>johannes.rainer@eurac.edu - github: jorainer twitter: jo_rainer"
date: "24 June 2018; Updated 20 June 2019"
output: 
  ioslides_presentation:
    widescreen: true
    fig_width: 7
    fig_height: 5
    fig_retina: 2
    fig_caption: false
    transition: faster
    css: jostyle.css
---

<style type="text/css">

slides > slide:not(.nobackground):after {
  content: '';
}

slides > slide {
    -webkit-transition:none !important;transition:none !important;
}

.build > * {
  -webkit-transition: opacity 0.1s ease-in-out;
  -webkit-transition-delay: 0.1s;
  -moz-transition: opacity 0.1s ease-in-out 0.1s;
  -o-transition: opacity 0.1s ease-in-out 0.1s;
  transition: opacity 0.1s ease-in-out 0.1s;
}

</style>

<!-- 
For options, settings etc see:
https://rmarkdown.rstudio.com/ioslides_presentation_format.html
https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html#adding-a-logo

    css: style.css


f fullscreen
command + increase size (zoom)
w toggle widescreen
o enable overview
h enable code highlight


Export to pdf:
webshot function from webshot package.
Might have to change stuff in the style sheet to make it working.
library(webshot)
install_phantomjs()
fn <- paste0("file://", normalizePath("xcms-preprocessing-ioslides.html"))
webshot(fn, "xcms-preprocessing-ioslides.pdf", delay = 4, vwidth = 1024, vheight = 768)
-->

## Hands-on? {.build}

<div>
- https://github.com/jotsetung/metabolomics2018
- Open *xcms-preprocessing.Rmd* in e.g. [RStudio](https://www.rstudio.com/).
</div>

## Content {.build}

This presentation focuses on updates of `xcms`:

- re-use data structures from Bioconductor's `MSnbase` package
- simplified raw data access

<div>
Content:

- Basic MS data handling ([`MSnbase`](https://github.com/lgatto/MSnbase))
- Simple MS data centroiding (`MSnbase`)
- LC-MS data pre-processing ([`xcms`](https://github.com/sneumann/xcms)):
  - chromatographic peak detection
  - alignment
  - correspondence
</div>


# Basic MS data handling

## Data import and representation {.build}

```{r load-libs, message = FALSE, results = "hide", echo = FALSE}
library(xcms)
library(magrittr)
## Set up parallel processing using 3 cores
register(bpstart(MulticoreParam(3)), default = TRUE) 
```

<div>
- **Data set**:
  - subset from 2 files with pooled human serum samples
  - UHPLC (Agilent 1290) coupled with Q-TOF MS (TripleTOF 5600+ AB Sciex)
  - HILIC-based chromatographic separation
</div>
<div>
- Define file names and sample descriptions.
```{r load-data, message = FALSE }
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)

## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls), injection_idx = c(1, 19),
                 sample = c("POOL_1", "POOL_2"), group = "POOL")
```
</div>


## Data import and representation {.build}

<div>
- Read data from mzML/mzXML/CDF files with `readMSData` function.
</div>

<div>
```{r, message = FALSE}
## Read the data
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd), 
                   mode = "onDisk") 
```
</div>

<div>
- `mode = "onDisk"`: reads only spectrum header from files, but no data.
- on-disk mode enables analysis of very large experiments.
</div>

## Basic data access {.build}

- Access sample/phenotype information using `pData` or `$`:

<div>
```{r show-fData, message = FALSE}
## Access phenotype information
pData(data)
```
</div>
<div>
```{r show-fData2, message = FALSE}
## Or individual columns directly using the $ operator
data$injection_idx
```
</div>


## Basic data access {.build}

<div>
- Access general spectrum information: `msLevel`, `centroided`, `rtime`,
  `polarity`.
</div>
<div>
- Access MS data: `spectra`, `mz`, `intensity`: reads data from files.
</div>
<div>
- In most cases we work with subsets: use filter functions to 
  subset the data:
  - `filterFile` subset to individual files/samples.
  - `filterRtime` restrict to specific retention time window.
  - `filterMz` restrict to m/z range.
  - `filterMsLevel` subset to certain MS level(s).
</div>

## {.build}

- Example: extract all spectra measured between 180 and 181
  seconds. Using the `%>%` (pipe) operator to avoid nested function calls.

```{r spectra-filterRt, message = FALSE }
## Get all spectra measured between 180 and 181 seconds
## Use %>% for better readability
sps <- data %>%
    filterRt(rt = c(180, 181)) %>%
    spectra
```
```{r spectra-filterRt-length}
## How many spectra?
length(sps)
```
```{r spectra-filterRt-sapply}
## From which file?
sapply(sps, fromFile) 
```

## {.build}

- Example: plot the data from the last spectrum

```{r spectrum-plot, message = FALSE, fig.width = 5, fig.height = 3.5}
plot(sps[[6]]) 
```

- But how to get chromatographic data?


## {.build}

- `chromatogram`: extract chromatographic data.


- Example: XIC for Serine (m/z of [M+H]+ adduct 106.0455).

```{r serine-xic, message = FALSE, fig.height = 4, fig.width = 7.5, eval = FALSE}
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    chromatogram(aggregationFun = "max") %>%
    plot()
```
```{r serine-xic-plot, message = FALSE, fig.height = 3.3, fig.width = 7.5, echo = FALSE}
par(mar = c(4, 4.5, 1, 0.5))
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    chromatogram(aggregationFun = "max") %>%
    plot()
```


## Centroiding of profile MS data {.build .faster}

<div>
- *centroiding* is the process in which mass peaks are reduced to a
  single, representative signal, their centroids.
</div>
<div>
- `xcms`, specifically *centWave* was designed for centroided data.
- `MSnbase` provides basic tools to perform MS data smoothing and 
  centroiding: `smooth` and `pickPeaks`.
</div>
<div>
- Example: show the combined m/z, rt and intensity data for 
  Serine.

```{r serine-profile-mode-data, message = FALSE, eval = FALSE}
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```
</div>

## {.flexbox .vcenter}

```{r serine-profile-mode-data2, message = FALSE, echo = FALSE}
## Filter the MS data to the signal from the Serine ion and plot it using
## type = "XIC"
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```

- plot `type = "XIC"` creates a combined chromatographic and *map* 
  visualization of the data.
  
## {.build}

<div>
- Example: smooth data with Savitzky-Golay filter followed by a centroiding that
  simply reports the maximum signal for each mass peak in each spectrum. See
  `?pickPeaks` for more advanced options.

```{r centroiding, message = FALSE, warning = FALSE, eval = FALSE}
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
    smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
    pickPeaks()
```
</div>
<div>
```{r, eval = FALSE}
## Plot the centroided data for Serine
data_cent %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```
</div>

----

```{r centroiding2, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 5, echo = FALSE}
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
    smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
    pickPeaks()

## Plot the centroided data for Serine
data_cent %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```

## {.build}

<div>
- Note: since data is not available in memory, data smoothing and centroiding is
  applied *on-the-fly* each time m/z or intensity values are accessed.
</div>
<div>
- To make changes persistent: export and re-read the data.
```{r remove-exported, message = FALSE, echo = FALSE, results = "hide"}
lapply(basename(fileNames(data)), function (z) {
    if (file.exists(z))
        file.remove(z)
})
```
</div>

```{r export-centroided, message = FALSE, warning = FALSE }
## Write the centroided data to files with the same names in the current
## directory
fls_new <- basename(fileNames(data))
writeMSData(data_cent, file = fls_new)

## Read the centroided data.
data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd),
                        mode = "onDisk") 
```


# LC-MS data preprocessing


## Chromatographic peak detection {.build}


- Aim: identify chromatographic peaks in the data.

<div>
- Function: `findChromPeaks`.
</div>
<div>
- Available methods:
  - *matchedFilter* (`MatchedFilterParam`) [Smith *Anal. chem.* 2006].
  - *centWave* (`CentWaveParam`) [Tautenhahn *BMC Bioinformatics* 2008].
  - *massifquant* (`MassifquantParam` [Conley *Bioinformatics* 2014].
</div>

## | centWave

- First step: identify regions of interest.

```{r out.width="600px", echo = FALSE}
knitr::include_graphics("images/centWave-ROI.png")
```

<div style="position:absolute; top:95%; left:60%; font-size: 50%">
[Tautenhahn *BMC Bionformatics* 2008]
</div>

----

- 2nd: peak detection in these regions using continuous wavelet transform.

```{r out.width="600px", echo = FALSE}
knitr::include_graphics("images/centWave-CWT.png")
```

<div style="position:absolute; top:95%; left:60%; font-size: 50%">
[Tautenhahn *BMC Bionformatics* 2008]
</div>

## {.build}

- Crucial centWave parameters: `peakwidth`, `ppm`; list all with `?CentWaveParam`.

- `peakwidth`: minimal and maximal expected peak width.

<div>
- Example: extract chromatographic data for Serine.

```{r centWave-default, message = FALSE, results = "hide", fig.height = 3.5, fig.width = 5, eval = FALSE}
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
                        mz = c(106.03, 106.06),
                        aggregationFun = "max")
plot(srn_chr)
```
</div>
<div>
```{r, echo = FALSE, fig.height = 3.2, fig.width = 5}
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
                        mz = c(106.03, 106.06),
                        aggregationFun = "max")
par(mar = c(4, 4.5, 1.5, 0.5))
plot(srn_chr)
```
</div>

## {.build}

- **New**: peak detection on `Chromatogram` objects.
- Perform peak detection using default centWave parameters in that data.

```{r centWave-default2, message = FALSE, results = "hide"}
cwp <- CentWaveParam()
findChromPeaks(srn_chr, param = cwp)
```

- **What went wrong?** What's the default for `peakwidth`?
  
```{r centWave-default3, message = FALSE}
peakwidth(cwp)
```
  
- Default for `peakwidth` does not match the current data.

  
## {.smaller .build}

- Reduce `peakwidth` and run peak detection again.
- *Note*: `chromatogram` will also extract identified chrom peaks.

```{r centWave-adapted, message = FALSE, fig.height = 3.5, width = 5, eval = FALSE}
peakwidth(cwp) <- c(2, 10)
srn_chr <- findChromPeaks(srn_chr, param = cwp)
## Plot the data and higlight identified peak area
plot(srn_chr)
```

```{r, message = FALSE, fig.height = 3.5, width = 5, echo = FALSE}
cwp <- CentWaveParam(peakwidth = c(2, 10))
srn_chr <- findChromPeaks(srn_chr, param = cwp)
## Plot the data and higlight identified peak area
par(mar = c(4, 4.5, 1.5, 0.5))
plot(srn_chr)
```

- Ideally check settings on more known compounds.

## {.smaller .build}

- `ppm`: maximal allowed scattering of m/z values for one ion.

- Example: evaluate the m/z scattering of the signal for Serine.

```{r Serine-mz-scattering-plot, message = FALSE, fig.height = 3.5, width = 5 }
## Restrict the data to signal from Serine
srn <- data_cent %>%
    filterRt(rt = c(179, 186)) %>%
    filterMz(mz = c(106.04, 106.06))

## Plot the data
plot(srn, type = "XIC") 
```

## {.build}

- Example: calculate the difference of m/z values between consecutive scans.

```{r define-ppm, message = FALSE }
## Extract mz values for Serine from first file
srn_mz <- unlist(mz(filterFile(srn, 1)))
## The difference between m/z values from consecutive scans in ppm
diff(srn_mz) * 1e6 / mean(srn_mz) 
```

- This should be performed ideally on more compounds.
- `ppm`: large enough to capture the full chromatographic peak.


## {.build}

<div>
- Perform chromatographic peak detection with our data set-specific settings.

```{r findPeaks-centWave, message = FALSE }
## Perform peak detection
ppm(cwp) <- 30
data_cent <- findChromPeaks(data_cent, param = cwp) 
```
</div>
<div>
- Result: `XCMSnExp` object extends the `OnDiskMSnExp`, contains preprocessing
  results **and** enables data access as described above.  </div>

## {.build}

- Use `chromPeaks` to access the peak detection results. 

```{r  xcmsnexp, message = FALSE}
head(chromPeaks(data_cent), n = 5) 
```
  

## Alignment - in short {.build}

<div>
- Aim: adjust shifts in retention times between samples.
</div>
<div>
- Function: `adjustRtime`.
</div>
<div>
- Available methods:
  - *obiwarp* (`ObiwarpParam`) [Prince *Anal. chem.* 2006]: warps the (full) 
    data to a reference sample.
</div>
<div>
  - *peakGroups* (`PeakGroupsParam`) [Smith *Anal. chem.* 2006]: 
    - align spectra from different samples based on *hook* peaks.
    - Need to define the hook peaks first: peaks present in most/all samples.
</div>


## {.build}

- Example: perform a peak grouping to define potential hook peaks and align the
  samples based on these.
- *Note:* details on initial peak grouping provided in the next section.

```{r alignment-correspondence, message = FALSE }
## Define the settings for the initial peak grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
                        minFraction = 1, binSize = 0.02)
data_cent <- groupChromPeaks(data_cent, pdp)
```

<div>
- Align the samples.
```{r alignment-correspondence-alignment, message = FALSE}
## Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data_cent <- adjustRtime(data_cent, param = pgp)
```
</div>


## {.build}

- Inspect difference between raw and adjusted retention times.

```{r alignment-result, message = FALSE, fig.width = 8, fig.height = 4 , eval = FALSE}
plotAdjustedRtime(data_cent) 
```
```{r alignment-result-plot, message = FALSE, fig.width = 8, fig.height = 3.5 , echo = FALSE}
par(mar = c(4, 4.5, 0.5, 0.5))
plotAdjustedRtime(data_cent)
```

- Difference between raw and adjusted retention times resonable.
- Hook peaks along the full retention time range.

## {.build}

- Plot BPC before and after alignment.
- Plot XIC of known compounds before and after alignment.
```{r serine-xic-adjusted, message = FALSE, fig.width = 8, fig.height = 3.1 }
## Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 0.9, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
                  rt = c(179, 186), adjustedRtime = FALSE))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
                  rt = c(179, 186))) 
```


## Correspondence {.build}

<div>
- Aim: group signal (peaks) from the same ion across samples.
</div>

<div>
- Function: `groupChromPeaks`.
- Methods available:
  - *peak density* (`PeakDensityParam`) [Smith *Anal. chem.* 2006].
  - *nearest* (`NearestPeaksParam`) [Katajamaa *Bioinformatics* 2006].
</div>


## | peak density {.build}

<div>
- Iterate through slices along m/z.
- Compare retention times of peaks within each slice and group peaks if they are
  close.
</div>
<div>
- Distribution of peaks along retention time axis is used to define which peaks
  to group.
</div>
<div>
- `plotChromPeakDensity`: plot distribution of identified peaks along rt for a
  given m/z slice; simulates correspondence analysis.
</div>

## {.build}

<div>
- Example: 
  - Plot data for the m/z slice containing the Serine peak.
  - Use `plotChromPeakDensity` to simulate a correspondence analysis in the same
    slice.
</div>
<div>
```{r correspondence-example, message = FALSE, eval = FALSE}
## Extract a BPC for an m/z slice containing serine
bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06),
                        aggregationFun = "max")
## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)

## Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_srn, param = pdp)
 
```
</div>

## {.smaller}

```{r correspondence-example-plot, message = FALSE, width = 7, height = 5, echo = FALSE}
## Plot the BPC for the m/z slice containing serine
par(mfrow = c(2, 1), mar = c(4, 4.3, 1, 0.5))
bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06),
                        aggregationFun = "max")
## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)

## Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_srn, param = pdp)
 
```

- Points are peaks per sample;
- black line: peak density distribution;
- grey rectangles: grouped peaks (features).


##

- Parameters:
  - `binSize`: m/z width of the data slice in which peaks are grouped.
  - `bw` defines the smoothness of the density function.
  - `maxFeatures`: maximum number of features to be defined in one bin.
  - `minFraction`: minimum proportion of samples (of one group!) for 
    which a peak has to be present.
  - `minSamples`: minimum number of samples a peak has to be present.


> - Parameters `minFraction` and `minSamples` depend on experimental layout!


> - `binSize` should be small enough to avoid peaks from different ions 
    measured at similar retention times to be grouped together.

> - `bw` is the most important parameter.

## {.build}
  
- Test default settings for a slice containing ions with similar m/z and rt:
  isomers Betaine and Valine ([M+H]+ m/z 118.08625).

```{r correspondence-bw, message = FALSE, eval = FALSE}
## Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)

```

##

```{r correspondence-bw-plot, message = FALSE, width = 7, height = 5, echo = FALSE}
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)

```

> - **Correspondence failed**: all peaks grouped into one feature!
> - Default for `bw` (`30`) too large for present data set.

## {.build}

- `plotChromPeakDensity` allows to evaluate and tune settings on data subsets.

- Test smaller `bw` (`1.8`) on the same slice.

```{r correspondence-bw2, message = FALSE, eval = FALSE}
## Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp) 
```


##

```{r correspondence-bw2-plot, message = FALSE, fig.width = 7, fig.height = 5, echo = FALSE}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp) 
```

> - Reducing the `bw` enabled grouping of isomers into different
    features.


##

- Perform the correspondence analysis with tuned settings.

```{r  correspondence-analysis, message = FALSE}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
                        minFraction = 0.4, binSize = 0.02)

## Perform the correspondence analysis
data_cent <- groupChromPeaks(data_cent, param = pdp) 
```

> - Evaluate results after correspondence: `plotChromPeakDensity` with `simulate
    = FALSE` shows the actual results from the correspondence.

> - Feature definitions are stored within the `XCMSnExp` object, can be accessed
    with `featureDefinitions`.


## {.build}

- Use `featureValues` to access the features' abundance estimates.

```{r}
## feature intensity matrix
fmat <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat) 
```

- `featureValues` parameters:
  - `value`: name of the column in `chromPeaks` that should be returned.
  - `method`: for features with multiple peaks in one sample: from which
    peak should the value be returned?

## Missing values {.build}

<div>
- Peak detection may have failed in one sample.
- Ion is not present in a sample.
</div>
<div>
- `fillChromPeaks` allows to *fill-in* signal for missing peaks from the
  feature area (defined by the median rt and mz of all peaks assigned to the
  feature).
</div>
<div>
- `fillChromPeaks` Parameters:
  - `expandMz`, `expandRt`: expands the region from which signal is integrated
    in m/z or rt dimension. A value of 0 means no expansion, 1 means the region
    is grown by half of the feature's m/z width on both sides.
  - `ppm`: expand the m/z width by a m/z dependent value.
</div>


## Summary {.build}

<div>
- The new data objects and functions aim to:
  - simplify data access and inspection of results
  - facilitate data set-dependent definition  of algorithm parameters.
</div>
<div>
- More work to come for the analysis of chromatographic data (SRM/MRM)
  and eventually for data normalization.
</div>
<div>
- **Don't blindly use default parameters!**
</div>


## Acknowledgments

> - Jan Stanstrup (University of Copenhagen, Denmark)

> - Laurent Gatto (University of Cambridge, UK); `MSnbase`, `mzR`.

> - Steffen Neumann (IPB Halle, Germany); `xcms`, `mzR`

> - **YOU for your attention!**

<div style="position:absolute; left:30%; top:90%">
https://github.com/jorainer/metabolomics2018
</div>