Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse CSQ entitities into individual columns #132

Open
audyavar opened this issue Apr 30, 2019 · 4 comments
Open

Parse CSQ entitities into individual columns #132

audyavar opened this issue Apr 30, 2019 · 4 comments

Comments

@audyavar
Copy link

audyavar commented Apr 30, 2019

Hello -

I have annotated my VCF files with VEP. I love the vcfR2tidy function which parses out the vcf file into different data frames for visualization and downstream analysis. However, I noticed that the CSQ column (where all the VEP functional annotation is added separated by "|") does not parse out into the individual columns, even though the meta data frame shows the details of all the annotations present in this column. Please advice.

Convert to tidy dataframe

test2 <- vcfR2tidy(vcfv, info_only = TRUE)
names(test)
test2 <- extract_info_tidy(vcfv, info_fields = NULL, info_types = TRUE)

The CSQ column still looks like this :
[3] "-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||||||||||rs773396084|2952|-1||deletion|HGNC|HGNC:28208||2|A2||ENSP00000343864|Q5SV97||UPI000022DAF4||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|protein_coding||9/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||1|A2|CCDS53256.1|ENSP00000368717|Q494U1||UPI00005764FF||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|protein_coding||8/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||2|A2||ENSP00000368719|Q494U1||UPI0000D61E06||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|protein_coding||9/15||||||||rs773396084||1||deletion|HGNC|HGNC:25284|YES|1|P3|CCDS4.1|ENSP00000368720|Q494U1||UPI00001416D8||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208|YES|5|P2|CCDS76083.1|ENSP00000414022|Q5SV97||UPI0003E30FA7||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208||1||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000480267|retained_intron||||||||||rs773396084|729|1||deletion|HGNC|HGNC:25284||3||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000491024|protein_coding||||||||||rs773396084|1260|1|cds_start_NF|deletion|HGNC|HGNC:25284||3|||ENSP00000462558||J3KSM5|UPI000268AE1F||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||"

@knausb
Copy link
Owner

knausb commented May 1, 2019

Hi Akshata, thanks for bringing this to may attention! I do not use VEP so I would not have noticed this. However, I really need a minimal reproducible example to address this. I've put some suggestions on how to do this here. If you could share a small part of your data or modify one of the vcfR example data sets so that I can reproduce the issue I could work towards a solution. Thanks!

@audyavar
Copy link
Author

Sure. Attached is an example vcf file. I have 2 columns of annotations - 1 from snpEff (ANN) and other from VEP (CSQ). Both of them remain unparsed when I use testg <- vcfR2tidy(vcfg) .

Other question I have for you is that these 2 columns contain annotations for multiple SNPs of the same gene. Is there a way to parse out the individual SNP annotations?

annvcf_variant_effect_output.vcf.zip

@yannickwurm
Copy link

yannickwurm commented Jun 13, 2020

Dear Brian, thanks for the excellent tool. Just a heads up that I've hit the same hurdle with VEP's CSQ column staying as one single column.

@yannickwurm
Copy link

yannickwurm commented Jun 13, 2020

FWIW, I overcame the issue by adding the following to my loading: (apologies for the hacky code)

#b is tibble 
b <- vcfR2tidy(genotypes, info_only = TRUE, verbose = TRUE)

# getting column names
header <- gsub(x = b$meta$Description[b$meta$ID == "CSQ"], 
               pattern = "Consequence annotations from Ensembl VEP. Format: ", 
               replacement = "")
VEP_columns <- unlist(stringr::str_split(string = header, pattern = "\\|"))

# getting contents
list_of_csq_vectors <- stringr::str_split(string = b$CSQ, pattern = "\\|")
if (any(lapply(list_of_csq_vectors, length) != length(VEP_columns))) {
   stop("Uh-oh. VEP isn't giving me the right numbers of columns in the CSQ field")
}

csq_data <- matrix(nrow = nrow(b$fix), ncol = length(VEP_columns))
colnames(csq_data) <- VEP_columns
for (i in 1:nrow(csq_data)) {
    csq_data[i, 1:ncol(csq_data)] <- list_of_csq_vectors[[i]]
}
csq_data[!nzchar(csq_data)] <- NA
csq_data <- data.frame(csq_data, stringsAsFactors = FALSE)

# merge the CSQ and initial $fix
b$fix <- cbind(b$fix[, colnames(b$fix) != "CSQ"], csq_data)

Fix edited because base R strsplit doesn't behave as it should with empty last columns (it ignores them!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants