xcms-preprocessing-bullets.Rmd

---
title: "Metabolomics data pre-processing using xcms"
author: 
- name: "Johannes Rainer"
  affiliation: "Eurac Research, Bolzano, Italy; johannes.rainer@eurac.edu github/twitter: jotsetung"
date: "24 June 2018"
---

# Background

- `xcms` part of Bioconductor since 2006, *standard* toolbox for LC/GC-MS 
  data preprocessing.
- Major changes in `xcms` version > 3:
  - re-use data structures from Bioconductor's `MSnbase` package
  - native MSn support
  - new functions simplifying raw data access
  - internal changes and code cleanup

## Mass spectrometry

- Mass spectrometry (MS): measure ion abundances in a sample.
- Data is represented/measured in a spectrum.

![](images/MS.png)

- Many ions have same/similar mass-to-charge ratio (m/z).
- Additional separation of compounds by other properties
  (hydrophobic/hydrophilic): liquid or gas chromatography.

![](images/LCMS.png)


## Definitions and common naming convention

- chromatographic peak: signal from an ion along retention time
  dimension.
- chromatographic peak detection: process in which chromatographic peaks
  are identified within each file.
- alignment: adjust retention time differences between files.
- correspondence: grouping of chromatographic peaks (presumably from the
  same ion) across files.
- feature: chromatographic peaks grouped across samples.

	
# Workflow: metabolomics data preprocessing using `xcms`

The workflow is focused on the new `xcms` interface and covers:

- Basic MS data handling (`MSnbase`)
- Simple MS data centroiding (`MSnbase`)
- LC-MS data pre-processing (`xcms`):
  - chromatographic peak detection
  - alignment
  - correspondence
- Not covered:
  - data normalization
  - compound identification
  - differential abundance analysis


## Prerequisites

- Rstudio
- R version >= 3.5.0
- Libraries:

```{r eval = FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite(c("xcms", "MSnbase", "doParallel", "msdata", "magrittr",
	   "devtools"))
## Need xcms version > 3.3.1
if (packageVersion("xcms") < "3.3.1")
    devtools::install_github("sneumann/xcms", ref = "master") 
```


## Data import and representation

- Read data from mzML/mzXML/CDF files with the `readMSData` function.
- `mode = "onDisk"` reads only spectrum header from the files, 
  but no data.
- *on-disk* mode enables analysis of very large experiments.
- Interactive code: read the toy data set:
  - subset from 2 files with pooled serum samples
  - UHPLC (Agilent 1290) coupled with Q-TOF MS (TripleTOF 5600+ AB Sciex)
  - HILIC-based chromatographic separation

```{r load-data}
library(MSnbase)
library(xcms)
library(doParallel)
library(magrittr)

## Define the file names.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)

## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls),
                 injection_idx = c(1, 19),
                 sample = c("POOL_1", "POOL_2"),
                 group = "POOL")
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd), 
                   mode = "onDisk") 
```


- Parallel processing setup should be defined at the start.
- Most functions from `xcms` and `MSnbase` are parallelized
  *per-file* and use the registered setup.
- Interactive code: parallel processing setup.

```{r parallel-setup, message = FALSE }
## Set up parallel processing using 3 cores
registerDoParallel(3)
register(bpstart(DoparParam()), default = TRUE) 
```


- `data` is an `OnDiskMSnExp`; access phenotype information 
  using `pData` or `$`, general spectrum information using `fData`.
- Interactive code: get to know the `OnDiskMSnExp` object.

```{r show-fData}
## Access phenotype information
pData(data)

## Or individual columns directly using the $ operator
data$injection_idx

## Access spectrum header information
head(fData(data))
```


## Basic data access and visualization

- MS data in `OnDiskMSnExp`s is organized by spectrum.
- Access general spectrum information with `msLevel`, `centroided`, 
  `rtime`, `polarity`.
- Use `fromFile` to know which values belong to which file/sample.
- `Spectrum` object: container for m/z and intensity values.
- Interactive code: access general spectrum information.

```{r  general-access}
## Get the retention time
head(rtime(data))

## How many spectra are there?
length(rtime(data))

## Get the retention times splitted by file.
rts <- split(rtime(data), fromFile(data))

## The result is a list of length 2. The number of spectra per file can
## then be determined with
lengths(rts) 
```


- `spectra` gets list of all spectra (from all files). Can be slow
  because the full data is read from the files.
- In most cases we work with subsets anyway: use filter functions to 
  subset the data:
  - `filterFile` subset to individual files/samples.
  - `filterRtime` restrict to specific retention time window.
  - `filterMz` restrict to m/z range.
  - `filterMsLevel` subset to certain MS level(s).
- Data access will be fast on indexed mzML, mzXML and CDF files.

- Interactive code: extract all spectra measured between 180 and 181
  seconds. Using the `%>%` (pipe) operator to avoid nested function calls.

```{r spectra-filterRt}
## Get all spectra measured between 180 and 181 seconds
## Use %>% for better readability
sps <- data %>%
    filterRt(rt = c(180, 181)) %>%
    spectra

## How many spectra?
length(sps)

## From which file?
sapply(sps, fromFile) 
```


- Interactive code: plot the data from the last spectrum

```{r spectrum-plot}
plot(sps[[6]]) 
```


- Spectra: intensities along the m/z dimension for discrete 
  retention times.
- `chromatogram`: extract chromatographic data (intensities
  along retention time for a certain m/z range).
- Interactive code: get the total ion chromatogram for each file.

```{r chromatogram}
## Get chromatographic data (TIC) for an m/z slice
chr <- chromatogram(data)
chr

## Plot the tic
plot(chr) 
```


- `chromatogram`: TIC with `aggregationFun = "sum"`, BPC with 
  `aggregationFun = "max"`.
- Interactive code: extract ion chromatogram for Serine ([M+H]+ adduct 
  m/z 106.0455 matches the second largest peak in spectrum above).

```{r serine-xic}
## Plot first the spectrum
par(mfrow = c(1, 2))
plot(mz(sps[[6]]), intensity(sps[[6]]), type = "h", xlab = "m/z",
     ylab = "intensity", main = rtime(sps[[6]]))
## Highlight the m/z range from which we extract the Serine XIC
rect(106.02, 0, 106.07, 70000, border = "#ff000040")

## Extract and plot the XIC for Serine
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    chromatogram(aggregationFun = "max") %>%
    plot()
```


- Functionality provides an easy access to raw data.
- `spectra` to get intensities along m/z for discrete retention time.
- `chromatogram` to get intensities along rt for m/z range.
- Use `rtime`, `mz`, `intensity` to access the MS values.


## Centroiding of profile MS data

- *centroiding* is the process in which mass peaks are reduced to a
  single, representative signal, their centroids.
- `xcms`, specifically *centWave* was designed for centroided data.
- Proper centroiding can improve data accuracy.
- `MSnase` provides basic tools to perform MS data smoothing and 
  centroiding: `smooth` and `pickPeaks`.
- Interactive code: show the combined m/z, rt and intensity data for 
  Serine.

```{r serine-profile-mode-data}
## Subset data to Serine ion and plot using type = "XIC"
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```


- plot `type = "XIC"` creates a combined chromatographic and *map* 
  visualization of the data.
- Interactive code: smooth data in m/z dimension with Savitzky-Golay 
  filter followed by a centroiding that simply reports the maximum 
  signal for each mass peak in each spectrum. See `?pickPeaks` for 
  more advanced options.

```{r centroiding}
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
    smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
    pickPeaks()

## Plot the centroided data for Serine
data_cent %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 
```


- Note: data smoothing and centroiding is applied to the data 
  *on-the-fly* each time m/z or intensity values are accessed.
- Interactive code: export the smoothed data to new files and 
  re-read  the data.

```{r export-centroided}
## Write the centroided data to files with the same names in the current
## directory
fls_new <- basename(fileNames(data))
writeMSData(data_cent, file = fls_new)

## Read the centroided data.
data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd),
			mode = "onDisk") 
```


## LC-MS data preprocessing


### Chromatographic peak detection

- Aim: identify chromatographic peaks in the data.
- Function: `findChromPeaks`.
- Available methods:
  - *matchedFilter* (`MatchedFilterParam`) [Smith Anal. chem. 2006].
  - *centWave* (`CentWaveParam`) [Tautenhahn BMC Bioinformatics 2008].
  - *massifquant* (`MassifquantParam`) [Conley Bioinformatics 2014].

**centWave**:
1) identify regions of interest.

![](images/centWave-ROI.png)

2) perform peak detection in these regions using continuous wavelet
   transform (CWT).

![](images/centWave-CWT.png)


- CentWave parameters:

```{r  centwave-help}
?CentWaveParam 
```


- Crucial parameters: `peakwidth`, `ppm`.
- `peakwidth`: minimal and maximal expected peak width. Depends on the
  LC settings of the experiment.
- Interactive code: extract chromatographic data for Serine and perform
  peak detection using default parameters

```{r centWave-default}
## Get the XIC for serine in the first file
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
			mz = c(106.03, 106.06),
			aggregationFun = "max")[1, 1]
## Plot the data
par(mfrow = c(1, 1), mar = c(4, 4.5, 1, 1))
plot(srn_chr)

## Get default centWave parameters
cwp <- CentWaveParam()

## "dry-run" peak detection on the XIC.
findChromPeaks(srn_chr, param = cwp)

peakwidth(cwp) 
```


- What went wrong? Default for `peakwidth` does not match data.
- Interactive code: change `peakwidth` and run again.

```{r centWave-adapted}
cwp <- CentWaveParam(peakwidth = c(2, 10))

pks <- findChromPeaks(srn_chr, param = cwp)

## Plot the data and higlight identified peak area
plot(srn_chr)
rect(pks[, "rtmin"], 0, pks[, "rtmax"], pks[, "maxo"], border = "#00000040") 
```


- Ideally check settings on more known compounds.
- `ppm`: maximal allowed scattering of m/z values for one ion.
- Interactive code: evaluate the m/z scattering of the signal for Serine.

```{r Serine-mz-scattering-plot}
## Restrict the data to signal from Sering
srn <- data_cent %>%
    filterRt(rt = c(179, 186)) %>%
    filterMz(mz = c(106.04, 106.06))

## Plot the data
plot(srn, type = "XIC") 
```


- Interactive code: calculate the difference in m/z values between 
  consecutive scans.

```{r define-ppm}
## Extract the Serine data for one file as a data.frame
srn_df <- as(filterFile(srn, 1), "data.frame")
head(srn_df)

## The difference between m/z values from consecutive scans in ppm
diff(srn_df$mz) * 1e6 / mean(srn_df$mz)
```


- Ideally this should also be performed on more compounds.
- `ppm` should be large enough to capture the full chromatographic peak.
- Interactive code: perform chromatographic peak detection.

```{r findPeaks-centWave}
## Perform peak detection
cwp <- CentWaveParam(peakwidth = c(2, 10), ppm = 30)
data_cent <- findChromPeaks(data_cent, param = cwp) 
```


- Result: `XCMSnExp` object extends the `OnDiskMSnExp`, contains 
  preprocessing results *and* enables data access as described above.
- Interactive code: access identified chromatographic peaks.

```{r xcmsnexp}
## Access the peak detection results
head(chromPeaks(data_cent))
```


- For quality assessment, we could also do some summary statistics on 
  the identified peaks or plot location of peaks in the m/z - rt plane
  with `plotChromPeaks`.


### Alignment

- Aim: adjusts shifts in retention times between samples.
- Interactive code: plot base peak chromatograms of all files.

```{r alignment-bpc-raw}
## Extract base peak chromatograms
bpc_raw <- chromatogram(data_cent, aggregationFun = "max")
par(mfrow = c(1, 1))
plot(bpc_raw)
```


- Function: `adjustRtime`.
- Available methods:
  - *peakGroups* (`PeakGroupsParam`) [Smith Anal. chem. 2006]: align
    samples based on hook peaks.
  - *obiwarp* (`ObiwarpParam`) [Prince Anal. chem. 2006]: warps the 
    (full) data to a reference sample.

- peakGroups works reasonably well in most cases.
- Need to define the hook peaks first: peaks present in most/all samples.
- Important parameters:
  - `minFraction`: proportion of samples in which a feature has to be 
    present (0.9 for present in 90% of samples).
  - `span`: degree of smoothing for the loess function, 0 likely 
    overfitting, 1 linear regression. Values between 0.4 and 0.6 seem
	reasonable.
- Interactive code: perform a peak grouping to allow definition of hook 
  peaks and align the samples based on these.

```{r alignment-correspondence}
## Define the settings for the initial peak grouping - details for
## choices in the next section.
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8, 
                        minFraction = 1, binSize = 0.02)
data_cent <- groupChromPeaks(data_cent, pdp)

## Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data_cent <- adjustRtime(data_cent, param = pgp) 
```


- Adjusted retention times are stored in the object.
- Interactive code: inspect the difference between raw and adjusted 
  retention times. Helps to determine whether settings were OK.

```{r alignment-result}
## Plot the difference between raw and adjusted retention times
plotAdjustedRtime(data_cent) 
```


- Evaluate alignment results:
  - difference between raw and adjusted retention time reasonable.
  - hook peaks along the full retention time range.
  - comparison of BPC (TIC) before/after alignment.
  - evaluate data for known compounds.
- Interactive code: plot BPC before and after alignment.

```{r bpc-raw-adjusted}
par(mfrow = c(2, 1), mar = c(3, 4.5, 1, 1))
## Plot the raw base peak chromatogram
plot(bpc_raw)
## Plot the BPC after alignment
plot(chromatogram(data_cent, aggregationFun = "max")) 
```

- Interactive code: plot Serine XIC before and after alignment.

```{r serine-xic-adjusted, message = FALSE, fig.cap = "XIC for Serine before (left) and after (right) alignment", fig.width = 10, fig.height = 4 }
## Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
		  rt = c(179, 186), adjustedRtime = FALSE))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
		  rt = c(179, 186))) 
```


- If we need to repeat simply remove alignment results with 
  `dropAdjustedRtime` and re-run.


### Correspondence

- Aim: group signal (peaks) from the same ion across samples.
- Function: `groupChromPeaks`.
- Methods available:
  - *peak density* (`PeakDensityParam`) [Smith Anal. chem. 2006].
  - *nearest* (`NearestPeaksParam`) [Katajamaa Bioinformatics 2006].

- peak density: 
  - iterates through slices of m/z ranges and groups chromatographic 
    ineach if peaks are close in retention time.
  - Whether they are close is estimated on the distribution of peaks 
    along the retention time.
- Interactive code: plot the data for the m/z slice containing the Serine
  peak and dry-run a correspondence analysis.

```{r correspondence-example}
## Plot the BPC for the m/z slice containing serine
par(mfrow = c(2, 1), mar = c(4, 4.3, 1, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06), aggregationFun = "max"))
highlightChromPeaks(data_cent, mz = c(106.04, 106.06),
		    whichPeaks = "apex_within")

## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)

## Dry-run correspondence and show the results.
plotChromPeakDensity(data_cent, mz = c(106.04, 106.06),
		     type = "apex_within", param = pdp)
```


- Black line: peak density estimate
- points: position of peaks along retention time axis per sample
- grey rectangles: grouped peaks (features).
  
- Parameters:
  - `binSize`: m/z width of the data slice in which peaks are grouped.
  - `bw` defines the smoothness of the density function.
  - `maxFeatures`: maximum number of features to be defined in one bin.
  - `minFraction`: minimum proportion of samples (of one group!) for 
    which a peak has to be present.
  - `minSamples`: minimum number of samples a peak has to be present.
  
- `minFraction` and `minSamples` depend on experimental layout!

- `binSize`: small enough to avoid grouping of peaks from different 
  ions measured at same retention time.

- Interactive code: determine acceptable `bw` setting. Plot data for 
  ions with same m/z and similar retention time: isomers Betaine and 
  Valine ([M+H]+ m/z 118.08625).

```{r correspondence-bw}
par(mfrow = c(3, 1), mar = c(3, 4.3, 1, 1))

## Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
plot(chromatogram(data_cent, mz = mzr, aggregationFun = "max"))
highlightChromPeaks(data_cent, mz = mzr, whichPeaks = "apex_within")

## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(data_cent, mz = mzr, param = pdp, type = "apex_within")

## Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(data_cent, mz = mzr, param = pdp, type = "apex_within") 
```


- Reducing the `bw` enabled grouping of isomers into separate
  features.
- Interactive code: perform the correspondence analysis.

```{r correspondence-analysis}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
                        minFraction = 0.4, binSize = 0.02)

## Perform the correspondence analysis
data_cent <- groupChromPeaks(data_cent, param = pdp) 
```


- Evaluate results after correspondence: check another slice with 
  isomers: Leucine, Isoleucine ([M+H]+ m/z 132.10191). Setting 
  `simulate = FALSE` shows the actual grouping results.

```{r correspondence-evaluate}
par(mfrow = c(2, 1), mar = c(3, 4.3, 1, 1))

## Plot the chromatogram for an m/z slice containing Leucine and Isoleucine
mzr <- 132.10191 + c(-0.01, 0.01)
plot(chromatogram(data_cent, mz = mzr, aggregationFun = "max"))
highlightChromPeaks(data_cent, mz = mzr, whichPeaks = "apex_within")

plotChromPeakDensity(data_cent, mz = mzr, param = pdp, type = "apex_within",
		     simulate = FALSE) 
```


- Interactive code: inspect definition of features and extract feature 
  intensities.

```{r correspondence-feature-values}
## Definition of the features
featureDefinitions(data_cent)

## Per-feature summary.
head(featureSummary(data_cent))

## feature intensity matrix
fmat <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat) 
```


- `featureValues` parameters:
  - `value`: name of the column in `chromPeaks` that should be returned.
  - `method`: for features with multiple peaks in one sample: from which
    peak the should the value be returned?

- About missing values: peak detection may have failed. `fillChromPeaks` 
  allows to *fill-in* signal for missing peaks from the feature area 
  (defined by the median rt and mz of all peaks assigned to the feature).
  Parameters:
  - `expandMz`: expands the region from which signal is integrated in m/z
    dimension. A value of 0 means no expansion, 1 means the region is grown 
    by half of the feature's m/z width on both sides.
  - `expandRt`: expand the retention time window of the feature for 
    integration.
  - `ppm`: expand the m/z width by a m/z dependent value.

- Interactive code: evaluate the number of missing peaks and use 
  `fillChromPeaks` to retrieve a signal for them from the raw files.

```{r fillChromPeaks}
## Number of missing values
sum(is.na(fmat))

## Define the settings for the fill-in of missing peaks
fpp <- FillChromPeaksParam(expandMz = 0.5, expandRt = 0.5, ppm = 20)
data_cent <- fillChromPeaks(data_cent, param = fpp)

## How many missing values after
sum(is.na(featureValues(data_cent)))

fmat_fld <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat_fld) 
```


- Note: `dropFilledChromPeaks` removes filled-in peaks again.

- `XCMSnExp` objects contain also the complete processing history
  including parameter classes.

```{r correspondence-result-object}
## Overview of the performed processings
processHistory(data_cent)

## Access the parameter class for a processing step
processParam(processHistory(data_cent)[[1]]) 
```


# Summary

- Don't blindly use default parameters!
- The new data objects and functions are aimed to:
  - simplify data access and inspection of results
  - facilitate data set-dependent definition  of algorithm parameters.
- More work to come for the analysis of chromatographic data (SRM/MRM)
  and eventually for data normalization.