-
Notifications
You must be signed in to change notification settings - Fork 12
/
xcms-preprocessing.Rmd
1182 lines (962 loc) · 51.4 KB
/
xcms-preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Metabolomics data pre-processing using xcms"
author:
- name: "Johannes Rainer"
affiliation: "Eurac Research, Bolzano, Italy; [email protected] github: jorainer twitter: jo_rainer"
graphics: yes
date: "24 June 2018; updated March 25 2021"
output:
BiocStyle::html_document:
number_sections: true
toc_float: true
toc_depth: 2
bibliography: references.bib
---
```{r style, message = FALSE, echo = FALSE, warning = FALSE, results = "asis"}
library("BiocStyle")
library("knitr")
library("rmarkdown")
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
cache = FALSE, fig.width = 7, fig.height = 7)
```
# Abstract
In this document we discuss mass spectrometry (MS) data handling and access
using Bioconductor's `r Biocpkg("MSnbase")` package [@Gatto:2012io] and walk
through the preprocessing of an (untargeted) LC-MS toy data set using the
`r Biocpkg("xcms")` package [@Smith:2006ic]. The preprocessing comprises
chromatographic peak detection, sample alignment and peak correspondence.
Particular emphasis is given on defining data-set dependent values for the most
important settings of popular preprocessing methods.
# Introduction
Preprocessing of untargeted metabolomics data is the first step in the analysis
of GC/LS-MS based untargeted metabolomics experiments. The aim of the
preprocessing is the quantification of signals from ion species measured in a
sample and matching of these entities across samples within an experiment. The
resulting two-dimensional matrix with feature abundances in all samples can then
be further processed, e.g. by normalizing the data to remove sampling
differences, batch effects or injection order-dependent signal drifts. Another
crucial step in untargeted metabolomics analysis is the annotation of the
(m/z-retention time) features to the actual ions and metabolites they represent.
Note that data normalization and annotation are not covered in this document.
People familiar with the concepts of mass spectrometry or LC-MS data analysis
may jump directly to the next section.
## Prerequisites
The analysis in this document requires an R version >= 4.0.0 and recent versions
of the `MSnbase` and `xcms` (version >= 3.11.3 is needed) packages. The packages
can be installed with:
```{r install-required, eval = FALSE}
#' Install the Bioconductor package manager
install.packages("BiocManager")
#' Install the required packages
BiocManager::install(c("xcms",
"MSnbase",
"msdata",
"magrittr",
"png"))
```
Because `xcms` version 3.11.3 is a developmental version we might have to
install it from github if the required version was not installed with the
previous command.
```{r check-xcms, message = FALSE, warning = FALSE}
if (!packageVersion("xcms") >= "3.11.3") {
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")
devtools::install_github("sneumann/xcms")
}
```
The
[xcms-preprocessing.Rmd](https://github.com/jorainer/metabolomics2018/blob/master/xcms-preprocessing.Rmd)
file with all the code for the analysis can be downloaded from [github]
(https://github.com/jorainer/metabolomics2018) ideally with `git clone
https://github.com/jorainer/metabolomics2018` from the command line.
## Mass spectrometry
Mass spectrometry allows to measure abundances of charged molecules (ions) in a
sample. Abundances are determined as ion counts for a specific mass-to-charge
ratio m/z. The measured signal is represented as a spectrum: intensities along
m/z.
![](images/MS.png)
Many ions have the same or a very similar m/z making it difficult or impossible
to discriminate them. MS is thus frequently coupled with a second technology to
separate analytes based on other properties than their mass (or rather
m/z). Common choices are gas chromatography (GC) or liquid chromatography
(LC). Such an e.g. LC-MS setup performs scans at discrete time points resulting
in a set of spectra for a given sample, with allows to separate compounds (ions)
on m/z and on retention time dimension.
![](images/LCMS.png)
In such GC/LC-MS based untargeted metabolomics experiments the data is analyzed
along the retention time dimension and *chromatographic* peaks (which are
supposed to represent the signal from a ion species) are identified and
quantified.
## Definitions and common naming convention
Naming conventions and terms used in this document are:
- chromatographic peak: peak containing the signal from an ion in retention time
dimension (different from a *mass* peak that represents the signal along the
m/z dimension within a spectrum).
- chromatographic peak detection: process in which chromatographic peaks are
identified within each file.
- alignment: process that adjusts for retention time differences between
measurements/files.
- correspondence: grouping of chromatographic peaks (presumably from the same
ion) across files.
- feature: chromatographic peaks grouped across files.
# Workflow: metabolomics data preprocessing using `xcms`
This workflow describes the basic data handling (I/O) of mass spectrometry data
using the `r Biocpkg("MSnbase")` package, and the LC/GC-MS data preprocessing
using `r Biocpkg("xcms")`. It showcases the new functionality and user interface
functions of `xcms`, that re-use functionality from the `MSnbase` package. The
first part of the workflow is focused on data import, access and visualization
which is followed by the description of a simple data centroiding approach and
finally the `xcms`-based LC-MS data preprocessing that comprises chromatographic
peak detection, alignment and correspondence. The workflow does not cover data
normalization procedures, compound identification and differential abundance
analysis.
## Data import and representation
The example data set of this workflow consists of two files in mzML format with
signals from pooled human serum samples measured with a ultra high performance
liquid chromatography (UHPLC) system (Agilent 1290) coupled with a Q-TOF MS
(TripleTOF 5600+ AB Sciex) instrument. Chromatographic separation was based on
hydrophilic interaction liquid chromatography (HILIC) separating metabolites
depending on their polarity. The setup thus allows to measure small polar
compounds and hence metabolites from the main metabolic pathways. The input
files contain all signals measured by the MS instrument (so called *profile
mode* data). To reduce file sizes, the data set was restricted to an m/z range
from 105 to 134 and retention times from 0 to 260 seconds.
In the code block below we first load all required libraries and define the
location of the mzML files, which are part of the `msdata` package. We also
define a `data.frame` describing the samples/experiment and pass this to the
`readMSData` function which imports the data. The option `mode = "onDisk"` tells
the function to read only general metadata into memory. The m/z and intensity
values are not imported but retrieved from the original files on demand, which
enables also analyses of very large experiments.
```{r load-data}
library(xcms)
library(magrittr)
## Define the file names.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls),
injection_idx = c(1, 19),
sample = c("POOL_1", "POOL_2"),
group = "POOL")
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
```
Note that for a *real* experiment it is suggested to define such a *phenodata*
table as an e.g. tab delimited text file (or an xls sheet) that contains all of
the raw data file names along with all the relevant sample information for each
file as additional columns. Such a data table could then be imported into R
using e.g. `read.table` and the file names could be specified for `readMSData`
with `files = paste0(MZML_PATH, "/", pd$mzML_file)` where `MZML_PATH` would be
the directory containing the data (mzML) files, `pd` would be the imported
phenodata table that contains in a column named `"mzML_file"` the names of the
individual raw files.
Next we set up parallel processing. This ensures that all required cores are
registered and available from the beginning of the analysis. All data access and
analysis functions of `xcms` and `MSnbase` are parallelized on a per-file basis
and will use this setup by default.
```{r parallel-setup, eval = TRUE}
#' Set up parallel processing using 2 cores
if (.Platform$OS.type == "unix") {
register(bpstart(MulticoreParam(2)))
} else {
register(bpstart(SnowParam(2)))
}
```
The MS experiment data is now represented as an `OnDiskMSnExp` object. Phenotype
information can be retrieved with the `pData` function, single columns in the
phenotype table using `$`. Below we access sample descriptions.
```{r show-pData}
#' Access phenotype information
pData(data)
#' Or individual columns directly using the $ operator
data$injection_idx
```
General information on each spectrum in the experiment can be accessed with the
`fData` function, that returns a `data.frame` each row with metadata information
for one spectrum.
```{r show-fData}
#' Access spectrum header information
head(fData(data), n = 3)
```
## Basic data access and visualization
The MS data in an `OnDiskMSnExp` object is organized by spectrum (similar to
*mzML* files), with `Spectrum` objects used as containers for the m/z and
intensity values. General spectrum information can be retrieved using the
`msLevel`, `centroided`, `rtime` or `polarity` functions that return the
respective value for all spectra from all files. Here, the `fromFile` function
can be helpful which returns for each spectrum the index of the file in which it
was measured. This is shown in the code block below.
```{r general-access}
#' Get the retention time
head(rtime(data))
#' Get the retention times splitted by file.
rts <- split(rtime(data), fromFile(data))
#' The result is a list of length 2. The number of spectra per file can
#' then be determined with
lengths(rts)
```
The `spectra` function can be used to retrieve the list of all spectra (from all
files). This will load the full data from all raw files, which can take,
depending on the size of the files and number of spectra, relatively long time
and requires, depending on the experiment, a considerable amount of memory. In
most cases we will however work with sub-sets of the data, and retrieving that
can, in the case of indexed mzML, mzXML and CDF files, be very fast. Data
objects can be subsetted using the filter functions: `filterFile`,
`filterRtime`, `filterMz` or `filterMsLevel` that filter the data by file,
retention time range, m/z range or MS level. To illustrate this we retrieve
below all spectra measured between 180 and 181 seconds. These contain the signal
from all compounds that eluted from the LC in that time window. Note that we use
the pipe operator `%>%` from the `magrittr` package for better readability.
```{r spectra-filterRt}
#' Get all spectra measured between 180 and 181 seconds
#' Use %>% to avoid nested function calls
sps <- data %>%
filterRt(rt = c(180, 181)) %>%
spectra
```
The result is a `list` of `Spectrum` objects. Below we determine the number of
spectra we have got.
```{r spectra-filterRt-length}
#' How many spectra?
length(sps)
```
We can use the `fromFile` function to determine from which file/sample each
spectrum is.
```{r spectra-filterRt-fromFile}
#' From which file?
vapply(sps, fromFile, integer(1))
```
We have thus 3 spectra per file. Next we plot the data from the last spectrum
(i.e. the 3rd spectrum in the present retention time window from the second
file).
```{r spectrum-plot, fig.cap = "Spectrum at a retention time of about 180 seconds."}
plot(sps[[6]])
```
We can immediately spot several mass peaks in the spectrum, with the largest one
at a m/z of about 130 and the second largest at about 106, which matches the
expected mass to charge ratio for the [M+H]+ ion (adduct) of
[Serine](https://en.wikipedia.org/wiki/Serine). Serine in its natural state is
not charged and can therefore not be measured directly with a MS
instrument. Uncharged compounds, such as Serine, have thus to be first ionized
to create charged molecules (e.g. using electrospray-ionization as in the
present data set). Such ionization would create [M+H]+ ions of Serine,
i.e. molecules that consist of Serine plus a hydrogen resulting in single
charged ions (with a mass equal to sum of the masses of Serine and hydrogen).
MS data is generally organized by spectrum, but in LC-MS experiments we analyze
the data along the retention time axis and hence orthogonally to the spectral
data representation. To extract data along the retention time dimension we can
use the `chromatogram` function. This function aggregates intensities for each
scan/retention time along the m/z axis (i.e. within each spectrum) and returns
the retention time - intensity duplets in a `Chromatogram` object, one per
file. The `Chromatogram` object supports, similar to the `Spectrum` object, the
`rtime` and `intensity` functions to access the respective data. Below we use
the `chromatogram` function to extract the total ion chromatogram (TIC) for each
file and plot it. The TIC represents the sum of all measured signals per
spectrum (i.e. per discrete time point) and provides thus a general information
about compound separation by liquid chromatography.
```{r chromatogram-tic, fig.cap = "Total ion chromatogram.", fig.width = 10, fig.height = 5}
#' Get chromatographic data (TIC) for an m/z slice
chr <- chromatogram(data)
chr
#' Plot the tic
plot(chr)
```
The object returned by the `chromatogram` function arranges the individual
`Chromatogram` objects in a two-dimensional array, columns being samples (files)
and rows data slices. Below we extract the (total ion) intensities from the TIC
of the first file.
```{r chromatogram-tic-intensity}
ints <- intensity(chr[1, 1])
head(ints)
```
The object contains also all phenotype information from the original `data`
variable. This can be accessed in the same way than for `OnDiskMSnExp` objects
(or most other data objects in Bioconductor).
```{r chromatogram-pdata}
#' Access the full phenotype data
pData(chr)
```
Depending on the parameter `aggregationFun`, the function can produce total ion
chromatograms (TIC, sum of the signal within a spectrum) with `aggregationFun =
"sum"`, or base peak chromatograms (BPC, maximum signal per spectrum) with
`aggregationFun = "max"`. The `chromatogram` function can also be used to
generate extracted ion chromatograms (EIC), which contain the signal from a
specific m/z range and retention time window, presumably representing the signal
of a single ion species. Below we extract and plot the ion chromatogram for
Serine after first filtering the data object to the retention time window and
m/z range containing signal for this compound.
```{r serine-xic, fig.cap = "Extracted ion chromatogram for the Serine [M+H]+ ion in both files."}
#' Extract and plot the XIC for Serine
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
chromatogram(aggregationFun = "max") %>%
plot()
```
The area of such a chromatographic peak is supposed to be proportional to the
amount of the corresponding ion in the respective sample and identification and
quantification of such peaks is one of the goals of the GC/LC-MS data
preprocessing.
## Centroiding of profile MS data
MS instruments allow to export data in profile or centroid mode. Profile data
contains the signal for all discrete m/z values (and retention times) for which
the instrument collected data [@Smith:2014di]. MS instruments continuously
sample and record signals and a mass peak for a single ion in one spectrum will
thus consist of a multiple intensities at discrete m/z values. Centroiding is
the process to reduce these mass peaks to a single representative signal, the
centroid. This results in much smaller file sizes, without loosing too much
information. `xcms`, specifically the *centWave* chromatographic peak detection
algorithm, was designed for centroided data, thus, prior to data analysis,
profile data, such as the example data used here, should be centroided. The
`MSnbase` package provides all tools to perform this centroiding (and data
smoothing) in R: `pickPeaks` and `smooth`.
Below we inspect the profile data for the [M+H]+ ion adduct of Serine. We again
subset the data to the m/z and retention time range containing signal from
Serine and `plot` the data with the option `type = "XIC"`, that generates a
combined chromatographic and *map* visualization of the data (i.e. a plot of the
individual m/z, rt and intensity data tuples with data points colored by their
intensity in the m/z - retention time space).
```{r serine-profile-mode-data, fig.cap = "Profile data for Serine.", fig.width = 10, fig.height = 5, fig.pos = "h!"}
#' Filter the MS data to the signal from the Serine ion and plot it using
#' type = "XIC"
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
The plot shows all data points measured by the instrument. Each *column* of data
points in the lower panel represents the signal measured at one discrete time
point, stored in one spectrum. We can see a distribution of the signal for
serine in both retention time and also in m/z dimension.
Next we smooth the data in each spectrum using a Savitzky-Golay filter, which
usually improves data quality by reducing noise. Subsequently we perform the
centroiding based on a simple peak-picking strategy that reports the maximum
signal for each mass peak in each spectrum.
```{r centroiding, fig.cap = "Centroided data for Serine.", fig.width = 10, fig.height = 5, fig.pos = "h!"}
#' Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
pickPeaks()
#' Plot the centroided data for Serine
data_cent %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
The centroiding reduced the data to a single data point for an ion in each
spectrum. For more advanced centroiding options that can also fine-tune the m/z
value of the reported centroid see the `pickPeaks` help or the centroiding
vignette in `MSnbase`.
The raw data was imported using the *onDisk*-mode that does not read the full MS
data into memory. Any data manipulation (such as the data smoothing or peak
picking above) has thus to be applied *on-the-fly* to the data each time m/z or
intensity values are retrieved. To make any data manipulation on an
`OnDiskMSnExp` object *persistent* we need to export the data to mzML files and
re-read the data again. Below we thus save the centroided data as mzML files and
import it again.
```{r export-centroided-prepare, echo = FALSE, results = "hide"}
#' Silently removing exported mzML files if they do already exist.
lapply(basename(fileNames(data)), function (z) {
if (file.exists(z))
file.remove(z)
})
```
```{r export-centroided}
#' Write the centroided data to files with the same names in the current
#' directory
fls_new <- basename(fileNames(data))
writeMSData(data_cent, file = fls_new)
#' Read the centroided data.
data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
```
## Preprocessing of LC-MS data
Preprocessing of GC/LC-MS data in untargeted metabolomics experiments aims at
quantifying the signal from individual ion species in a data set and consists
of the 3 steps chromatographic peak detection, alignment (also called retention
time correction) and correspondence (also called peak grouping). The resulting
matrix of feature abundances can then be used as an input in downstream analyses
including data normalization, identification of features of interest and
annotation of features to metabolites.
### Chromatographic peak detection
Chromatographic peak detection aims to identify peaks along the retention time
axis that represent the signal from individual compounds' ions. This can be
performed with the `findChromPeaks` function and one of the available algorithms
that can be configures with the respective parameter object: passing a
`MatchedFilterParam` to `findChromPeaks` performs peak detection as described in
the original xcms article [@Smith:2006ic]. With `CentWaveParam` a continuous
wavelet transformation (CWT)-based peak detection is performed that can detect
close-by and partially overlapping peaks with different (retention time) widths
[@Tautenhahn:2008fx]. With `MassifquantParam` a Kalman filter-based peak
detection can be performed [@Conley:2014ha]. Additional peak detection
algorithms for direct injection data are also available, but not discussed here.
We use the *centWave* algorithm that performs peak detection in two steps: first
it identifies *regions of interest* in the m/z - retention time space and
subsequently detects peaks in these regions using a continuous wavelet transform
(see the original publication for more details). The algorithm can be configured
with several parameters (see `?CentWaveParam`), the most important ones being
`peakwidth` and `ppm`. `peakwidth` defines the minimal and maximal expected
width of the peak in retention time dimension and depends thus on the setting of
the employed LC-MS system making this parameter highly data set
dependent. Appropriate values can be estimated based on extracted ion
chromatograms of e.g. internal standards or known compounds in the data. Below
we extract the chromatographic data for Serine and perform a peak detection on
the `Chromatogram` object using the default parameters for centWave.
```{r centWave-default, fig.cap = "XIC for Serine", results = "hide"}
#' Get the XIC for serine in all files
srn_chr <- chromatogram(data_cent, rt = c(164, 200),
mz = c(106.03, 106.06),
aggregationFun = "max")
#' Plot the data
par(mfrow = c(1, 1), mar = c(4, 4.5, 1, 1))
plot(srn_chr)
#' Get default centWave parameters
cwp <- CentWaveParam()
#' "dry-run" peak detection on the XIC.
res <- findChromPeaks(srn_chr, param = cwp)
chromPeaks(res)
```
No peaks were identified by the call above. Looking at the default values for
the centWave parameters helps understanding why peak detection failed:
```{r centWave-default-parameters}
cwp
```
The default settings for `peakwidth` are 20 to 50 seconds, while from the plot
above it is apparent that the chromatographic peak for Serine is about 4 seconds
wide. Below we adapt the settings to accommodate peaks ranging from 2 to 10
seconds and re-run the peak detection. In general, it is advised to investigate
peak widths for several ions in the data set to determine the most appropriate
`peakwidth` setting. Note that we use also `integrate = 2` which uses a
different way to define the peak borders that can (as in the present case)
result in better peak boundary estimates (try with the default `integrate = 1`
to compare the difference).
```{r centWave-adapted, fig.cap = "XIC for Serine with detected chromatographic peak", results = "hide"}
cwp <- CentWaveParam(peakwidth = c(2, 10), integrate = 2)
srn_chr <- findChromPeaks(srn_chr, param = cwp)
#' Plot the data and higlight identified peak area
plot(srn_chr)
```
With our data set-specific `peakwidth` we were able to detect the peak for
Serine. The identified chromatographic peaks have been added to the result
object `srn_chr` and can be extracted/inspected with the `chromPeaks` function.
```{r chromPeaks-chromatogram}
chromPeaks(srn_chr)
```
The `matrix` returned by `chromPeaks` contains the retention time and m/z range
of the peak (`"rtmin"`, `"rtmax"`, `"mzmin"` and `"mzmax"` as well as the
integrated peak area (`"into"`), the maximal signal (`"maxo"`) and the signal to
noise ratio (`"sn"`).
Another important parameter for centWave is `ppm` which is used in the initial
identification of the regions of interest. In contrast to random noise, the
*real* signal from an ion is expected to yield stable m/z values in consecutive
scans (the scattering of the m/z values around the *real* m/z value of the ion
is supposed to be inversely related with its intensity). In centWave, all data
points that differ by less than `ppm` in consecutive spectra are combined into a
region of interest that is then subject to the CWT-based peak detection (same
as performed above on the XIC). To illustrate this, we plot the data for
Serine with the option `type = "XIC"`.
```{r Serine-mz-scattering-plot}
#' Restrict the data to signal from Serine
srn <- data_cent %>%
filterRt(rt = c(179, 186)) %>%
filterMz(mz = c(106.04, 106.06))
#' Plot the data
plot(srn, type = "XIC")
```
We can observe some scattering of the data points in m/z dimension (lower panel
in the plot above), that decreases with increasing intensity of the signal.
Below we identify the sample with the highest signal (sum of all intensities)
for the selected m/z - rt range. We thus first extract the intensities for all
spectra and split these by file. Then we sum all intensities reported for a
file.
```{r sample-highest-signal}
#' Extract intensities and split them by file. This will return
#' a list of lists.
ints_per_file <- split(intensity(srn), fromFile(srn))
#' For each file, sum up the intensities.
ints_sum <- lapply(ints_per_file, function(x) sum(unlist(x)))
ints_sum
```
We next calculate the differences in m/z values between consecutive scans in
this data subset. We do this for one *representative* file, selecting the one
with the highest total sum of intensities in the selected region.
```{r define-ppm}
#' Extract the Serine data for one file as a data.frame
srn_df <- as(filterFile(srn, file = which.max(ints_sum)), "data.frame")
#' The difference between m/z values from consecutive scans expressed
#' in ppm
diff(srn_df$mz) * 1e6 / mean(srn_df$mz)
```
The difference in m/z values for the Serine data is thus between 0 and 27
ppm. This should ideally be evaluated for several compounds and should be set to
a value that allows to capture the full chromatographic peaks for most of the
tested compounds. We can next perform the peak detection using our settings for
the `ppm` and `peakwidth` parameters.
```{r findPeaks-centWave}
#' Perform peak detection
cwp <- CentWaveParam(peakwidth = c(2, 10), ppm = 30, integrate = 2)
data_cent <- findChromPeaks(data_cent, param = cwp)
```
The result from the `findChromPeaks` call is an `XCMSnExp` object which contains
all preprocessing results and, by extending the `OnDiskMSnExp` object, inherits
all of its functionality that has been described so far. The results from the
peak detection analysis can be accessed with the `chromPeaks` function, that,
with the optional `rt` and `mz` parameters, allows to extract identified
chromatographic peaks from specific areas in the data. Below we extract all
identified peaks for a certain m/z - rt area.
```{r xcmsnexp}
#' Access the peak detection results from a specific m/z - rt area
chromPeaks(data_cent, mz = c(106, 107), rt = c(150, 190))
```
For each identified peak the m/z and rt value of the apex is reported (columns
`"mz"` and `"rt"`) as well as their ranges (`"mzmin"`, `"mzmax"`, `"rtmin"`,
`"rtmax"`), the integrated signal of the peak (i.e. the peak area `"into"`),
the maximal signal of the peak (`"maxo"`), the signal to noise ratio (`"sn"`)
and the index of the sample in which the peak was detected (`"sample"`).
Note that **after** peak detection, extracted ion chromatogram data will also
contain identified chromatographic peaks. Below we extract the EIC for Serine
and display the identified peaks for that compound.
```{r chrom-after}
eic_serine <- chromatogram(data_cent, mz = c(106.04, 106.06),
rt = c(179, 186))
chromPeaks(eic_serine)
```
And plotting this extracted ion chromatogram will also show the identified
chromatographic peaks.
```{r plot-after}
plot(eic_serine)
```
A plot of type `"XIC"` **after** peak detection will also indicate the rt-m/z
region from which signal is integrated (the red rectangle in the plot below).
```{r xic-after}
srn <- data_cent %>%
filterRt(rt = c(175, 188)) %>%
filterMz(mz = c(106.04, 106.06))
plot(srn, type = "XIC")
```
Peak detection will not always work perfectly leading to peak detection
artifacts, such as overlapping peaks or artificially split peaks. The
`refineChromPeaks` function allows to *refine* peak detection results by either
or removing identified peaks exceeding a maximal allower rt width or by merging
artificially split chromatographic peaks. For the former a `CleanPeaksParam`
parameter object can be submitted to the function while the latter can be
configured with the `MergeNeighboringPeaksParam` object. Below we post-process
the peak detection results merging peaks overlapping in a 4 second window per
file if the signal between in between them is lower than 75% of the smaller
peak's maximal intensity. See the `MergeNeighboringPeaksParam` help page for a
detailed description of the settings and the approach.
```{r}
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
data_cent_pp <- refineChromPeaks(data_cent, param = mpp)
```
Merged peaks can be identified with a `TRUE` in the `"merged"` chromatographic
peak metadata column (see below).
```{r}
chromPeakData(data_cent_pp)
```
An example for a joined (merged) peak is given below
```{r merged-peak, fig.width = 12, fig.height = 6, fig.cap = "Result from the peak refinement. Left: peaks before merging, right after merging."}
mzr <- c(124.084, 124.088)
rtr <- c(150, 170)
chr_1 <- chromatogram(filterFile(data_cent, 2), mz = mzr, rt = rtr)
chr_2 <- chromatogram(filterFile(data_cent_pp, 2), mz = mzr, rt = rtr)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
```
At last we replace the `data_cent` variable with the object containing also the
peak refinement results.
```{r}
data_cent <- data_cent_pp
```
For quality assessment we could now calculate summary statistics on the
identified peaks to e.g. identify samples with much less detected peaks. Also,
we can use the `plotChromPeaks` function to provide some general information on
the location of the identified chromatographic peaks in the m/z - rt space.
```{r plotChromPeaks, fig.cap = "Location of the identified chromatographic peaks in the m/z - rt space.", fig.height = 7, fig.width = 12}
par(mfrow = c(1, 2))
plotChromPeaks(data_cent, 1)
plotChromPeaks(data_cent, 2)
```
### Alignment
While chromatography helps to discriminate better between analytes it is also
affected by variances that can lead to shifts in retention times between
measurement runs. The alignment step aims to adjust these retention time
differences between samples within an experiment. Below we plot the base peak
chromatograms of both files of our toy data set to visualize these
differences. Note that with `peakType = "none"` we disable plotting of
identified chromatographic peaks that would be drawn by default on chromatograms
extracted from an object containing peak detection results.
```{r alignment-bpc-raw, fig.cap = "BPC of all files.", fig.width = 8, fig.height = 4}
#' Extract base peak chromatograms
bpc_raw <- chromatogram(data_cent, aggregationFun = "max")
plot(bpc_raw, peakType = "none")
```
While both samples were measured with the same setup on the same day the two
chromatograms are slightly shifted.
Alignment can be performed with `xcms` using the `adjustRtime` function that
supports the *peakGroups* [@Smith:2006ic] and the *obiwarp* [@Prince:2006jj]
method. The settings for the algorithms can be defined with the
`PeakGroupsParam` and the `ObiwarpParam` parameter objects, respectively. Note
also that `xcms` supports now a *subset-based* alignment which would allow to
align a whole data set based on retention time shift adjustments estimated on
e.g. QC samples. See the `xcms`
[vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html)
for more information.
For our example we use the peakGroups method that aligns samples based on the
retention times of *hook peaks*, which should be present in most samples and
which, because they are supposed to represent signal from the same ion species,
can be used to estimate retention time shifts between samples. Prior to the
alignment we thus have to group peaks across samples, which is accomplished by
the *peakDensity* correspondence analysis method. Details about this method and
explanations on the choices of its parameters are provided in the next
section. After having performed this initial correspondence analysis, we perform
the alignment using settings `minFraction = 1` and `span = 0.6`. `minFraction`
defines the proportion of samples in which a candidate hook peak has to be
detected/present. A value of 0.9 would e.g. require for a hook peak to be
detected in in 90% of all samples of the experiment. Our data represents
replicated measurements of the same sample pool and we can therefore require
hook peaks to be present in each file. The parameter `span` defines the degree
of smoothing of the loess function that is used to allow different regions along
the retention time axis to be adjusted by a different factor. A value of 0 will
most likely cause overfitting, while 1 would perform a constant, linear
shift. Values between 0.4 and 0.6 seem to be reasonable for most experiments.
```{r alignment-correspondence}
#' Define the settings for the initial peak grouping - details for
#' choices in the next section.
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
minFraction = 1, binSize = 0.02)
data_cent <- groupChromPeaks(data_cent, pdp)
#' Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data_cent <- adjustRtime(data_cent, param = pgp)
```
Adjusted retention times are stored, along with the raw retention times, within
the result object. Any function accessing retention times (such as `rtime`) will
by default return adjusted retention times from an `XCMSnExp` object, if
present. Note that also the retention times of the identified chromatographic
peaks were adjusted by the `adjustRtime` call. After alignment it is suggested
to evaluate alignment results e.g. by inspecting differences between raw and
adjusted retention times.
```{r alignment-result, fig.width = 8, fig.height = 4, fig.cap = "Alignment results. Shown is the difference between raw and adjusted retention times and the hook peaks that were used for the alignment (shown as points)."}
#' Plot the difference between raw and adjusted retention times
plotAdjustedRtime(data_cent)
```
The difference between raw and adjusted retention time should be reasonable. In
our example it is mostly below one second, which is OK since the samples were
measured within a short time period and differences are thus expected to be
small. Also, hook peaks should ideally be present along the full retention time
range. Next we plot the base peak chromatograms before and after alignment. Note
that a `chromatogram` call will always include all detected chromatographic
peaks in the requested rt range but for a base peak chromatogram we don't need
this information. With `include = "none"` in the `chromatogram` call we tell the
function to **not** include any identified chromatographic peaks in the
returned chromatographic data.
```{r bpc-raw-adjusted, fig.cap = "BPC before (top) and after (bottom) alignment.", fig.width = 10, fig.height = 8}
par(mfrow = c(2, 1))
#' Plot the raw base peak chromatogram
plot(bpc_raw, peakType = "none")
#' Plot the BPC after alignment
plot(chromatogram(data_cent, aggregationFun = "max", include = "none"))
```
The base peak chromatograms are nicely aligned after retention time
adjustment. The impact of the alignment should also be evaluated on known
compounds or internal standards. We thus plot below the XIC for Serine before
and after alignment.
```{r serine-xic-adjusted, fig.cap = "XIC for Serine before (left) and after (right) alignment", fig.width = 10, fig.height = 4}
#' Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
rt = c(179, 186), adjustedRtime = FALSE))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
rt = c(179, 186)))
```
The Serine peaks are also nicely aligned after adjustment. Note that if we were
not happy with the alignment results we could simply retry with different
settings after removing old results with the `dropAdjustedRtime` function. This
function restores also the original retention times of the identified
chromatographic peaks.
### Correspondence
The final step of the LC-MS preprocessing with `xcms` is the correspondence
analysis, in which chromatographic peaks from the same ion are grouped across
samples to form a *feature*. `xcms` implements two methods for this purpose:
*peak density* [@Smith:2006ic] and *nearest* [@Katajamaa:2006jh] that can be
configured by passing either a `PeakDensityParam` or a `NearestPeaksParam`
object to the `groupChromPeaks` function. For our example we use the *peak
density* method that iterates through m/z slices in the data and groups
chromatographic peaks to features in each slice (within the same or across
samples) depending on their retention time and the distribution of
chromatographic peaks along the retention time axis. Peaks representing signal
from the same ion are expected to have a similar retention time and, if found in
many samples, this should also be reflected by a higher peak density at the
respective retention time. To illustrate this we extract below an m/z slice
containing the Serine peak and use the `plotChromPeakDensity` function to
visualize the distribution of peaks along the retention time axis and to
*simulate* a correspondence analysis based on the provided settings.
```{r correspondence-example, fig.cap = "BPC for a m/z slice and defined features within this slice based on default settings.", fig.width = 10, fig.height = 7}
#' Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
#' Extract a BPC for the m/z slice containing serine
bpc_serine <- chromatogram(data_cent, mz = c(106.04, 106.06),
aggregationFun = "max")
#' Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_serine, param = pdp)
```
The upper panel in the plot above shows the chromatographic data with the
identified peaks. The lower panel shows the retention time of identified peaks
(x-axis) per sample (y-axis) with the black solid line representing their
distribution along the x-axis. Peak groups (features) are indicated with grey
rectangles. The peak density correspondence method groups all chromatographic
peaks under the same *density peak* into a feature. With the default settings we
were able to group the Serine peak of each sample into a feature. The
parameters for the peak density correspondence analysis are:
- `binSize`: m/z width of the bin/slice of data in which peaks are grouped.
- `bw` defines the smoothness of the density function.
- `maxFeatures`: maximum number of features to be defined in one bin.
- `minFraction`: minimum proportion of samples (of one group!) for which a peak
has to be present.
- `minSamples`: minimum number of samples a peak has to be present.
The parameters `minFraction` and `minSamples` depend on the experimental layout
and should be set accordingly. `binSize` should be set to a small enough value
to avoid peaks from different ions, but with similar m/z and retention time,
being grouped together. The most important parameter however is `bw` and, while
its default value of 30 was able to correctly group the Serine peaks, it should
always be evaluated on other, more complicated, signals too. Below we evaluate
the performance of the default parameters on an m/z slice that contains signal
from multiple ions with the same m/z, including isomers Betaine and Valine
([M+H]+ m/z 118.08625).
```{r correspondence-bw, fig.cap = "Correspondence analysis with default settings on an m/z slice containing signal from multiple ions.", fig.width = 10, fig.height = 7}
#' Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
#' Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)
```
With default settings all chromatographic peaks present in the m/z slice were
grouped into the same feature. Signal from different ions would thus be treated
as a single entity. Below we repeat the analysis with a strongly reduced value
for `bw`.
```{r correspondence-bw-fix, fig.cap = "Correspondence analysis with reduced bw setting on a m/z slice containing signal from multiple ions.", fig.width = 10, fig.height = 7}
#' Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp)
```
With a `bw` of `1.8` we successfully grouped the peaks into different
features. We can now use these settings for the correspondence analysis on the
full data set.
```{r correspondence-analysis}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
minFraction = 0.4, binSize = 0.02)
#' Perform the correspondence analysis
data_cent <- groupChromPeaks(data_cent, param = pdp)
```
Next we evaluate the results from the correspondence analysis on a different m/z
slice containing isomers Leucine and Isoleucine ([M+H]+ m/z 132.10191). Setting
`simulate = FALSE` in `plotChromPeakDensity` will show the actual results from
the correspondence analysis.
```{r correspondence-evaluate, fig.cap = "Result of correspondence on a slice containing the isomers Leucine and Isoleucine.", fig.width = 10, fig.heigt = 7}
#' Plot the results for an m/z slice containing Leucine and Isoleucine
mzr <- 132.10191 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
plotChromPeakDensity(chr, simulate = FALSE)
```
Despite being very close, chromatographic peaks of isomers were successfully
grouped into separate features.
Results from the correspondence analysis can be accessed with the
`featureDefinition` function. This function returns a data frame with the
retention time and m/z ranges of the apex positions from the peaks assigned to
the feature and their respective indices in the `chromPeaks` matrix.
```{r correspondence-featureDefinitions}
#' Definition of the features
featureDefinitions(data_cent)
```
Note that the EIC `chr` above would also contain the correspondence results for
the selected m/z range which could also be accessed with
`featureDefinitions(chr)`.
Also, we can calculate simple per-feature summary statistic with the
`featureSummary` function. This function reports for each feature the total
number and the percentage of samples in which a peak was detected and the total
numbers and percentage of these samples in which more than one peak was assigned
to the feature.
```{r correspondence-featureSummary}
#' Per-feature summary.
head(featureSummary(data_cent))
```
The final result from the LC-MS data preprocessing is a matrix with feature
abundances, rows being features, columns samples. Such a matrix can be extracted
with the `featureValues` function from the result object (see further below for
an alternative, preferred way to extract preprocessing results with the
`quantify` method). The function takes two additional parameters `value` and
`method`: `value` defines the column in the `chromPeaks` table that should be
reported in the matrix, and `method` the approach to handle cases in which more
than one peak in a sample is assigned to the feature.
Below we set `value = "into"` (the default) to extract the total integrated peak
area and `method = "maxint"` to report the peak area of the peak with the
largest intensity for features with multiple peaks in a sample.
```{r correspondence-featureValue}
#' feature intensity matrix
fmat <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat)
```
While we do have abundances reported for most features, we might also have
missing values for some, like for feature *FT002* in the second sample
above. Such `NA`s occur if no chromatographic peak was assigned to a feature,
either because peak detection failed, or because the corresponding ion is absent
in the respective sample. One possibility to deal with such missing values is
data imputation. With the `fillChromPeaks` function, `xcms` provides however an
alternative approach that integrates the signal measured at the m/z - retention
time region of the feature in the original files of samples for which an `NA`
was reported hence *filling-in* missing peak data. Two different approaches are
available to define the m/z - retention time region from which signal should be
integrated: `ChromPeakAreaParam` defines the region based on the m/z and rt
range of the identified chromatographic peaks of the feature and
`FillChromPeaksParam` uses columns `"mzmin"`, `"mzmax"`, `"rtmin"` and `"rtmax"`
in the `featureDefinitions` data frame. While the `FillChromPeaksParam`
resembles the peak filling-in of the *original* `xcms` implementation, it is
suggested to use the `ChromPeakAreaParam` approach instead.
Below we perform the gap-filling with the `ChromPeakAreaParam` approach.
```{r fillChromPeaks}
#' Number of missing values
sum(is.na(fmat))
data_cent <- fillChromPeaks(data_cent, param = ChromPeakAreaParam())