-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse CSQ entitities into individual columns #132
Comments
Hi Akshata, thanks for bringing this to may attention! I do not use VEP so I would not have noticed this. However, I really need a minimal reproducible example to address this. I've put some suggestions on how to do this here. If you could share a small part of your data or modify one of the vcfR example data sets so that I can reproduce the issue I could work towards a solution. Thanks! |
Sure. Attached is an example vcf file. I have 2 columns of annotations - 1 from snpEff (ANN) and other from VEP (CSQ). Both of them remain unparsed when I use Other question I have for you is that these 2 columns contain annotations for multiple SNPs of the same gene. Is there a way to parse out the individual SNP annotations? |
Dear Brian, thanks for the excellent tool. Just a heads up that I've hit the same hurdle with VEP's CSQ column staying as one single column. |
FWIW, I overcame the issue by adding the following to my loading: (apologies for the hacky code) #b is tibble
b <- vcfR2tidy(genotypes, info_only = TRUE, verbose = TRUE)
# getting column names
header <- gsub(x = b$meta$Description[b$meta$ID == "CSQ"],
pattern = "Consequence annotations from Ensembl VEP. Format: ",
replacement = "")
VEP_columns <- unlist(stringr::str_split(string = header, pattern = "\\|"))
# getting contents
list_of_csq_vectors <- stringr::str_split(string = b$CSQ, pattern = "\\|")
if (any(lapply(list_of_csq_vectors, length) != length(VEP_columns))) {
stop("Uh-oh. VEP isn't giving me the right numbers of columns in the CSQ field")
}
csq_data <- matrix(nrow = nrow(b$fix), ncol = length(VEP_columns))
colnames(csq_data) <- VEP_columns
for (i in 1:nrow(csq_data)) {
csq_data[i, 1:ncol(csq_data)] <- list_of_csq_vectors[[i]]
}
csq_data[!nzchar(csq_data)] <- NA
csq_data <- data.frame(csq_data, stringsAsFactors = FALSE)
# merge the CSQ and initial $fix
b$fix <- cbind(b$fix[, colnames(b$fix) != "CSQ"], csq_data) Fix edited because base R strsplit doesn't behave as it should with empty last columns (it ignores them!) |
Hello -
I have annotated my VCF files with VEP. I love the vcfR2tidy function which parses out the vcf file into different data frames for visualization and downstream analysis. However, I noticed that the CSQ column (where all the VEP functional annotation is added separated by "|") does not parse out into the individual columns, even though the meta data frame shows the details of all the annotations present in this column. Please advice.
Convert to tidy dataframe
test2 <- vcfR2tidy(vcfv, info_only = TRUE)
names(test)
test2 <- extract_info_tidy(vcfv, info_fields = NULL, info_types = TRUE)
The CSQ column still looks like this :
[3] "-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||||||||||rs773396084|2952|-1||deletion|HGNC|HGNC:28208||2|A2||ENSP00000343864|Q5SV97||UPI000022DAF4||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|protein_coding||9/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||1|A2|CCDS53256.1|ENSP00000368717|Q494U1||UPI00005764FF||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|protein_coding||8/14||||||||rs773396084||1||deletion|HGNC|HGNC:25284||2|A2||ENSP00000368719|Q494U1||UPI0000D61E06||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|protein_coding||9/15||||||||rs773396084||1||deletion|HGNC|HGNC:25284|YES|1|P3|CCDS4.1|ENSP00000368720|Q494U1||UPI00001416D8||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208|YES|5|P2|CCDS76083.1|ENSP00000414022|Q5SV97||UPI0003E30FA7||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron||||||||||rs773396084|2953|-1||deletion|HGNC|HGNC:28208||1||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000480267|retained_intron||||||||||rs773396084|729|1||deletion|HGNC|HGNC:25284||3||||||||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||,-|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000491024|protein_coding||||||||||rs773396084|1260|1|cds_start_NF|deletion|HGNC|HGNC:25284||3|||ENSP00000462558||J3KSM5|UPI000268AE1F||||||||||||4.879e-05|9.151e-05|3.645e-05|0|0|7.49e-05|5.849e-05|0|4.991e-05|9.151e-05|gnomAD_AFR||||||||"
The text was updated successfully, but these errors were encountered: