DDA identification - de novo #356

bittremieux · 2024-09-03T11:55:28Z

bittremieux
Sep 3, 2024
Maintainer

aim of the new module

This module intends to evaluate the performance of de novo peptide sequencing tools. De novo tools annotate MS/MS spectra with their originating peptides without any external information, such as a protein database. Due to the lack of dependence on a set of predefined proteins, de novo sequencing has several important use cases, including:

Immunopeptidomics, as non-specific digestion results in a very large search space.
Metaproteomics, as the organisms to be considered might not be known.
Discovery of unexpected peptide and protein variants not included in the database.
...

full description of the new module

This module will be restricted to the identification of MS/MS spectra measured using DDA. Thus, the input and output can be simplied as:

Input MS/MS spectra (in MGF format).
Output peptide sequences assigned to the MS/MS spectra (e.g. in mzTab or CSV format).
Ground truth peptide sequencs for each of the MS/MS spectra (obtained using sequence database searching) to calculate evaluation metrics (e.g. in mzTab or CSV format).

Data

As the evaluation data, the nine-species benchmark dataset introduced by DeepNovo can be used. This dataset is commonly used to evaluate (and train) de novo tools. This dataset contains MS/MS data from nine different species as follows:

Vigna mungo: 932,848 spectra from 24 runs.
Mus musculus: 306,786 spectra from 13 runs.
Methanosarcina mazei: 3,728,183 spectra from 72 runs.
Bacillus subtilis: 4,336,428 spectra from 106 runs.
Candidatus endoloripes: 2,272,023 spectra from 11 runs.
Solanum lycopersicum: 603,506 spectra from 60 runs.
Saccharomyces-cerevisiae: 1,477,397 spectra from 27 runs.
Apis mellifera: 823,169 spectra from 17 runs.
Homo sapiens: 684,821 spectra from 26 runs.

Recently, a more balanced subset of this data has been proposed that restricts each species to approximately 100,000 high-quality PSMs, amounting to 779,796 spectra and 180,238 unique peptides in total (Noble, under review at Scientific Data). I propose that we use this last version to use a consistent and manageable dataset. The DOI for this dataset is pending.

Q: What would be the best format for this data? Nine different MGF files, one for each species? Or one (relatively big) MGF file with all spectra?

A: Consensus is to combine all of the data in a single MGF file for simplicity.

The following search settings were used to obtain the PSMs:

Static modification: Cys carbamidomethylation
Variable modifications: Met oxidation, Asn deamidation, Gln deamidation, N-term acetylation, N-term loss of ammonia, N-term carbamylation, the combination of N-term loss of ammonia and N-term carbamylation
Protease: trypsin
Precursor mass tolerance: between 10 ppm and 30 ppm, depending on the species
Fragment mass tolerance: 0.02 Da or 0.05 Da, depending on the species
Search engine: Tide (using XCorr scoring and Tailor calibration) + Percolator

Metric calculation

The performance will be evaluated at the amino acid and peptide level. As introduced by DeepNovo, a correct amino acid prediction is defined as any predicted amino acid whose mass differs by less than 0.1 Da from the corresponding ground truth amino acid. Additionally, this predicted amino acid must have either a prefix or suffix that differs by no more than 0.5 Da in mass from the corresponding amino acid sequence in the ground truth peptide. Correct peptides are defined as those sequences where all amino acid predictions meet these criteria, ensuring that only fully accurate predictions are considered correct at the peptide level.

This information will be condensed into two values:

The amino acid precision at coverage=1.
The peptide precision at coverage=1.

With precision measuring the proportion of correct predictions among all predictions, and defined as in standard classification. Coverage is analogous to recall, but adapted to the fact that some de novo tools don't report results for all spectra. Thus, coverage represents the proportion of predictions made. We will use the precision at full coverage, i.e. considering predictions for all input amino acids or peptides, respectively. Spectra for which a tool can't make a prediction will be considered as fully incorrect.

This can be implemented as two swarmplots side by side, for both metrics, to visualize the performance of all tools.

To calculate the metrics, the output peptide sequences need to be compared with the ground truth peptide sequences. Thus, the users will need to upload a custom CSV with the necessary information (recommended in ProForma notation). As no standardized output format is currently being used by the many newly introduced de novo tools, I propose to not try to directly support all these different formats, but instead request a simple CSV format to be provided, irrespective of the tool used (columns: file name, spectrum index, peptide sequence).

potential reviewers

No response

Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers?

yes
no

any other information

Tagging @PominovaMS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProteoBench

DDA identification - de novo #356

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

ProteoBench

DDA identification - de novo #356

bittremieux Sep 3, 2024 Maintainer

aim of the new module

full description of the new module

potential reviewers

Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers?

any other information

Replies: 0 comments

bittremieux
Sep 3, 2024
Maintainer