Parse CSQ entitities into individual columns #132

audyavar · 2019-04-30T23:19:42Z

Hello -

I have annotated my VCF files with VEP. I love the vcfR2tidy function which parses out the vcf file into different data frames for visualization and downstream analysis. However, I noticed that the CSQ column (where all the VEP functional annotation is added separated by "|") does not parse out into the individual columns, even though the meta data frame shows the details of all the annotations present in this column. Please advice.

Convert to tidy dataframe

test2 <- vcfR2tidy(vcfv, info_only = TRUE)
names(test)
test2 <- extract_info_tidy(vcfv, info_fields = NULL, info_types = TRUE)

The CSQ column still looks like this :
[3] "-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||||||||||rs773396084|2952|-1||deletion|HGNC|HGNC:28208||2|A2||ENSP00000343864|Q5SV97||UPI000022DAF4||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|protein_coding||9/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||1|A2|CCDS53256.1|ENSP00000368717|Q494U1||UPI00005764FF||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|protein_coding||8/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||2|A2||ENSP00000368719|Q494U1||UPI0000D61E06||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|protein_coding||9/15||||||||rs773396084||1||deletion|HGNC|HGNC:25284|YES|1|P3|CCDS4.1|ENSP00000368720|Q494U1||UPI00001416D8||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208|YES|5|P2|CCDS76083.1|ENSP00000414022|Q5SV97||UPI0003E30FA7||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208||1||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000480267|retained_intron||||||||||rs773396084|729|1||deletion|HGNC|HGNC:25284||3||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000491024|protein_coding||||||||||rs773396084|1260|1|cds_start_NF|deletion|HGNC|HGNC:25284||3|||ENSP00000462558||J3KSM5|UPI000268AE1F||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||"

knausb · 2019-05-01T15:45:08Z

Hi Akshata, thanks for bringing this to may attention! I do not use VEP so I would not have noticed this. However, I really need a minimal reproducible example to address this. I've put some suggestions on how to do this here. If you could share a small part of your data or modify one of the vcfR example data sets so that I can reproduce the issue I could work towards a solution. Thanks!

audyavar · 2019-05-15T07:23:20Z

Sure. Attached is an example vcf file. I have 2 columns of annotations - 1 from snpEff (ANN) and other from VEP (CSQ). Both of them remain unparsed when I use testg <- vcfR2tidy(vcfg) .

Other question I have for you is that these 2 columns contain annotations for multiple SNPs of the same gene. Is there a way to parse out the individual SNP annotations?

annvcf_variant_effect_output.vcf.zip

yannickwurm · 2020-06-13T19:01:28Z

Dear Brian, thanks for the excellent tool. Just a heads up that I've hit the same hurdle with VEP's CSQ column staying as one single column.

yannickwurm · 2020-06-13T19:51:14Z

FWIW, I overcame the issue by adding the following to my loading: (apologies for the hacky code)

#b is tibble 
b <- vcfR2tidy(genotypes, info_only = TRUE, verbose = TRUE)

# getting column names
header <- gsub(x = b$meta$Description[b$meta$ID == "CSQ"], 
               pattern = "Consequence annotations from Ensembl VEP. Format: ", 
               replacement = "")
VEP_columns <- unlist(stringr::str_split(string = header, pattern = "\\|"))

# getting contents
list_of_csq_vectors <- stringr::str_split(string = b$CSQ, pattern = "\\|")
if (any(lapply(list_of_csq_vectors, length) != length(VEP_columns))) {
   stop("Uh-oh. VEP isn't giving me the right numbers of columns in the CSQ field")
}

csq_data <- matrix(nrow = nrow(b$fix), ncol = length(VEP_columns))
colnames(csq_data) <- VEP_columns
for (i in 1:nrow(csq_data)) {
    csq_data[i, 1:ncol(csq_data)] <- list_of_csq_vectors[[i]]
}
csq_data[!nzchar(csq_data)] <- NA
csq_data <- data.frame(csq_data, stringsAsFactors = FALSE)

# merge the CSQ and initial $fix
b$fix <- cbind(b$fix[, colnames(b$fix) != "CSQ"], csq_data)

Fix edited because base R strsplit doesn't behave as it should with empty last columns (it ignores them!)

gorgitko mentioned this issue Oct 11, 2022

new feature: separation of annotation columns in tidy dataframes from VCFs annotated by snpEff or VEP tools #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse CSQ entitities into individual columns #132

Parse CSQ entitities into individual columns #132

audyavar commented Apr 30, 2019 •

edited

Loading

knausb commented May 1, 2019

audyavar commented May 15, 2019

yannickwurm commented Jun 13, 2020 •

edited

Loading

yannickwurm commented Jun 13, 2020 •

edited

Loading

Parse CSQ entitities into individual columns #132

Parse CSQ entitities into individual columns #132

Comments

audyavar commented Apr 30, 2019 • edited Loading

Convert to tidy dataframe

knausb commented May 1, 2019

audyavar commented May 15, 2019

yannickwurm commented Jun 13, 2020 • edited Loading

yannickwurm commented Jun 13, 2020 • edited Loading

audyavar commented Apr 30, 2019 •

edited

Loading

yannickwurm commented Jun 13, 2020 •

edited

Loading

yannickwurm commented Jun 13, 2020 •

edited

Loading