Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: separation of annotation columns in tidy dataframes from VCFs annotated by snpEff or VEP tools #201

Closed
wants to merge 2 commits into from

Conversation

gorgitko
Copy link

@gorgitko gorgitko commented Oct 7, 2022

Hi,

this is an extension to the vcfR2tidy() function. It allows to separate annotation columns in tidy dataframes coming from VCFs annotated by snpEff or VEP tools (but it can generalize to any other annotation format). Annotations generated by these tools are following the Variant annotations in VCF format specification.


Let's take a quick example of VEP-annotated VCF imported by {vcfR} (available as package dataset):

library(vcfR)
library(magrittr)
data("vcfR_test_VEP")
vcf_tidy <- vcfR2tidy(vcfR_test_VEP)

Now in vcf_tidy$meta there is a INFO row with ID "CSQ" containing information about annotation fields:

dplyr::filter(vcf_tidy$meta, ID == "CSQ")
# A tibble: 1 × 5
  Tag   ID    Number Type   Description                                                                                                                               
  <chr> <chr> <chr>  <chr>  <chr>                                                                                                                                     
1 INFO  CSQ   .      String Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HG

And the annotation data are stored in the CSQ column in vcf_tidy$fix:

dplyr::select(vcf_tidy$fix, ID, CSQ) %>% head()
# A tibble: 6 × 2
  ID          CSQ                                                                                                                                                     
  <chr>       <chr>                                                                                                                                                   
1 rs3011926   G|splice_donor_region_variant&intron_variant|LOW|CAMTA1|ENSG00000171735|Transcript|ENST00000303635|protein_coding||11/22||||||||rs3011926&COSV57887793|2 NA          -|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter||||||||||||||deletion|||||||||||||||||||||||||||||||||||||||||||||||,…
3 NA          CCCCCCCCC|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter||||||||||||||insertion|||||||||||||||||||||||||||||||||||||||4 rs560787659 C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter||||||||||rs560787659||||SNV|||||||||||||||||||0.0084|0.0106|0.0058|0.5 NA          C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter||||||||||||||SNV|||||||||||||||||||||||||||||||||||||||||||||||,C|ups6 NA          C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter||||||||||||||SNV|||||||||||||||||||||||||||||||||||||||||||||||,C|mis

Now using the functions in the PR, we can separate these annotations:

  • Row-wise: a single row (variant) can have multiple annotations separated by comma.
  • Column-wise: separate each annotation into new columns whose names are defined in vcf_tidy$meta.
vcf_tidy_sep <- separate_ann_vep(vcf_tidy$fix, vcf_tidy$meta)
dplyr::select(vcf_tidy_sep, 1:6, CSQ, allele, consequence, impact) %>% head()
# A tibble: 6 × 10
  ChromKey CHROM      POS ID        REF   ALT        CSQ                                                                                         allele conse…¹ impact
     <int> <chr>    <int> <chr>     <chr> <chr>      <chr>                                                                                       <chr>  <chr>   <chr> 
1        1 chr1   7677739 rs3011926 A     G          G|splice_donor_region_variant&intron_variant|LOW|CAMTA1|ENSG00000171735|Transcript|ENST000G      spliceLOW   
2        1 chr1  26697265 NA        CAG   C          -|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter|||||||||-      regulaMODIF3        1 chr1  26697265 NA        CAG   C          -|upstream_gene_variant|MODIFIER||ENSG00000260063|Transcript|ENST00000569378|lncRNA|||||||-      upstreMODIF4        1 chr1  26697265 NA        CAG   C          -|frameshift_variant|HIGH|ARID1A|ENSG00000117713|Transcript|ENST00000324856|protein_coding-      framesHIGH  
5        1 chr1  26697267 NA        G     GCCCCCCCCC CCCCCCCCC|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000250867|promoter|CCCCCregulaMODIF6        1 chr1  26697267 NA        G     GCCCCCCCCC CCCCCCCCC|upstream_gene_variant|MODIFIER||ENSG00000260063|Transcript|ENST00000569378|lncRNCCCCCupstreMODIF# … with abbreviated variable name ¹​consequence

I have included two new test datasets (vcfR_test_snpEff and vcfR_test_VEP), written a proper (:crossed_fingers:) documentation of new functions, and included some basic unit tests. R CMD CHECK passed both locally and in the GHA workflow here.

I hope this extension can serve well.

P.S. Sorry for some modifications in code formatting, but I have my RStudio configured to strip trailing horizontal whitespaces on save.

@gorgitko
Copy link
Author

btw. this PR is solving #132 and #188

@knausb
Copy link
Owner

knausb commented Oct 17, 2022

Hi @gorgitko , thank you for putting this together. At the present I am going to pass on this PR. There appear to be numerous issues here. Perhaps the most major issue is that while I am aware of 'VEP' and 'SNPeff' I do not currently use them. When you make a PR you're not simply asking for your code to be incorporated into another project, you're asking the package maintainer to maintain this code. Because I do not use this data I really do not know anything about its format or contents. This makes me a poor candidate to maintain this.

You mentioned an issue with how white space is handled by you computer. This appears to be a substantial issue because the diff viewer flags these lines. This means that you PR appears to have many false positives in the diff viewer. This makes it very difficult for me to review this PR. You've also made your changes to the 'master' branch. This presents a challenge to reviewing this because I would consider it unadvisable to merge a PR from some stranger directly into the master/main branch. I have tried to clone your PR into a separate instance, but this has added to the work of trying to figure out which lines may actually be changes as well as me time in trying to figure out what the format of these 'annotations' may be. I feel I've made a good faith effort here and I don't see what you're trying to do, and every time I try I seem to find more issues. For example, why do we need 6 additional package dependencies just to make a table? I suspect there may be more here then I realize, but I'm feeling like every new door I open I discover three new doors, instead of the information I need.

Would it be more positive to take this conversation in a direction where your code could be used as the beginning of a new package? That way someone with a knowledge of the formats of these specifications can maintain it?

@gorgitko
Copy link
Author

Hi @knausb, thanks for looking at this PR.

I understand your concerns about the future maintenance of my code. My motivation to introduce this feature was both my personal need to import such annotated VCFs and at the same time to help others as, regarding the two issues above, I am not alone. I thought it might be a nice addition to the current tidy data-related functions in vcfR.

Just to clarify that unfortunate thing with changes in newlines, I am used to somehow standardize my code, and this is one of its manifestations (stripping of trailing horizontal whitespaces). I also like to use styleR - "non-invasive pretty printing of R code". It makes collaboration much easier as everyone is using the same code style (which can be arbitrarily set by the package author/maintainer). Also, it can be also automatically run on new PRs.

Would it be more positive to take this conversation in a direction where your code could be used as the beginning of a new package? That way someone with a knowledge of the formats of these specifications can maintain it?

Yes, that would make sense. I would try to make a tiny package and let you know here when it's ready. Then I would be grateful if my package could be added to the vcfR's README 🙂

@gorgitko gorgitko closed this Oct 18, 2022
@knausb
Copy link
Owner

knausb commented Oct 25, 2022

Hi @gorgitko , your perspectives on style here have prevented you from communicating the changes that you have attempted to propose. This is because you have defeated the purpose of the diff viewer, which is supposed to help reviewers find the changes you propose. From my perspective, I invested some time in trying to review your code. When I looked at the diff viewer I saw a message that told me that there were so many changes that they would not be rendered for some pages. I decided to take a deep breath and start from the top. The first change that I saw was one of these 'false positives' that appear to be created by your white space edits. I had to look at it for a bit because I did not see any changes (because it was white space changes). After looking at this a bit I realized the issue was that it was a white space change. This means no actionable item, so I moved on to the next, which was also white space edits. After looking at a few of these I realized that it was not worth my time to try and figure out how much of this PR was white space changes. After I invested some of my time trying to review this I still had no idea of what you were proposing. I never got to a single line of code that included a substantive change. I simply determined that I was not making any progress because there were so many white space edits that I could not find any real changes. I feel a goal when proposing changes to other projects is to communicate your changes as simply and clearly as possible. I never got to any real code changes because of all the white space edits and instead decided that it was not worth my time.

To my knowledge there is no minimal package size for CRAN, so you should not feel apologetic for this. If the package is too large it will throw a note during CRAN checks. vcfR currently throws this note, which is another reason I feel I need to be conservative about additions. Please let me know when you have a package and I will link it.

Good luck!
Brian

@zahrahaider
Copy link

Hi,
Did the separate_ann_vep function make it to the vcfR package? How can we use it or which package did it end up in?

@gorgitko
Copy link
Author

Hi @zahrahaider, you can install my fork using remotes::install_github("gorgitko/vcfR")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants