This is a fresh software. Initially, I recommend you cross-check the results with InfernoRDN (homepage). This one is a rewrite of the RRollup method found in InfernoRDN.
Known discrepancies from original:
- Filtering of a dataset containing ~1200 proteins led to this version filtering out four proteins not filtered by InfernoRDN. Closer inspection showed that these didn't fulfill the criteria of 50% non-missing values in at least one peptide (10/21 present) but was anyway retained by InfernoRDN.
I wish:
- Optionally write the scaled peptides, and the scaled peptides without outlier matrices
- Optional profile plots of peptides and resulting protein
install.packages(c("tidyverse", "outliers", "argparser", "matrixStats"))
Can be executed from terminal using the protein_rollup.R
script present in the RWorkflowModules Git repo (https://github.com/Jakob37/RWorkflowModules).
If you have multiple annotation columns in your data matrix, you need to use a design matrix to distinguish these
Rscript run_protein_rollup.R \
--rdf_fp testdata/dummy_df.tsv \
--ddf_fp ProteinRollup/testdata/dummy_df_design.tsv \
--sample_col sample \
--protein_col Protein \
--out_fp test_out.tsv
If having an expression dataset with a single annotation column specifying the proteins, the following call can be executed.
Rscript run_protein_rollup.R \
--rdf_fp testdata/dummy_df.tsv \
--out_fp test_out.tsv \
--one_column_mode TRUE
library(tidyverse)
source("ProteinRollup.R")
peptide_rdf <- read_tsv("testdata/dummy_df.tsv")
peptide_mat <- peptide_rdf[, -1] %>% as.matrix()
protein_ids <- peptide_rdf$Protein
protein_rollup(protein_ids, peptide_mat)
- A reference peptide is selected based on (1) fewest missing values and (2) if several peptides ties - the one with highest median intensity
- Ratios to reference are calculated for all. The ratio is simply subtracting the log-scale reference from each peptide
- For each non-reference peptide - the median distance to the reference are calculated
- If lower than
min_overlap
, then set to zero in the median calculation (this is from Inferno, maybe to NA would make more sense? Now bias towards reference)
- If lower than
- The other peptides are shifted to the reference median. Intrapeptide variances are retained.
- Outlier removal is performed
minPs
specifies how many peptides needs to be present with value in a sample for the outlier calculation to be carried out (default 5)pvalue
specify how certain the Grubb outlier test should be for removal
- Peptides can optionally be centered (TODO: What consequence?)
- A rollup score is calculated (meaning what? why is not outlier removed peptides used here?)
- The final protein matrix is returned
min_overlap
- How many values should be in common between reference and other peptides for the other peptides to be considered