DDA identification - de novo #356
bittremieux
started this conversation in
Potential new module to discuss
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
aim of the new module
This module intends to evaluate the performance of de novo peptide sequencing tools. De novo tools annotate MS/MS spectra with their originating peptides without any external information, such as a protein database. Due to the lack of dependence on a set of predefined proteins, de novo sequencing has several important use cases, including:
full description of the new module
This module will be restricted to the identification of MS/MS spectra measured using DDA. Thus, the input and output can be simplied as:
Data
As the evaluation data, the nine-species benchmark dataset introduced by DeepNovo can be used. This dataset is commonly used to evaluate (and train) de novo tools. This dataset contains MS/MS data from nine different species as follows:
Recently, a more balanced subset of this data has been proposed that restricts each species to approximately 100,000 high-quality PSMs, amounting to 779,796 spectra and 180,238 unique peptides in total (Noble, under review at Scientific Data). I propose that we use this last version to use a consistent and manageable dataset. The DOI for this dataset is pending.
Q: What would be the best format for this data? Nine different MGF files, one for each species? Or one (relatively big) MGF file with all spectra?
A: Consensus is to combine all of the data in a single MGF file for simplicity.
The following search settings were used to obtain the PSMs:
Metric calculation
The performance will be evaluated at the amino acid and peptide level. As introduced by DeepNovo, a correct amino acid prediction is defined as any predicted amino acid whose mass differs by less than 0.1 Da from the corresponding ground truth amino acid. Additionally, this predicted amino acid must have either a prefix or suffix that differs by no more than 0.5 Da in mass from the corresponding amino acid sequence in the ground truth peptide. Correct peptides are defined as those sequences where all amino acid predictions meet these criteria, ensuring that only fully accurate predictions are considered correct at the peptide level.
This information will be condensed into two values:
With precision measuring the proportion of correct predictions among all predictions, and defined as in standard classification. Coverage is analogous to recall, but adapted to the fact that some de novo tools don't report results for all spectra. Thus, coverage represents the proportion of predictions made. We will use the precision at full coverage, i.e. considering predictions for all input amino acids or peptides, respectively. Spectra for which a tool can't make a prediction will be considered as fully incorrect.
This can be implemented as two swarmplots side by side, for both metrics, to visualize the performance of all tools.
To calculate the metrics, the output peptide sequences need to be compared with the ground truth peptide sequences. Thus, the users will need to upload a custom CSV with the necessary information (recommended in ProForma notation). As no standardized output format is currently being used by the many newly introduced de novo tools, I propose to not try to directly support all these different formats, but instead request a simple CSV format to be provided, irrespective of the tool used (columns: file name, spectrum index, peptide sequence).
potential reviewers
No response
Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers?
any other information
Tagging @PominovaMS.
Beta Was this translation helpful? Give feedback.
All reactions