Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: ID column contains non-unique names #193

Open
bathycy opened this issue Dec 17, 2021 · 1 comment
Open

Error: ID column contains non-unique names #193

bathycy opened this issue Dec 17, 2021 · 1 comment

Comments

@bathycy
Copy link

bathycy commented Dec 17, 2021

When I try to run vcfR on a vcf file I have I keep running into the same error when I try to extract the GT from the Genotype Section (Error in extract.gt(x = vcf, element = format_fields[i], as.numeric = coerce_numeric[i]) : ID column contains non-unique names). When I head the file it looks fine initially but I cant seem to run any other commands on it. Can you guys help me with this.
[1] "***** Object of class 'vcfR' "
[1] "
Meta section "
[1] "##fileformat=VCFv4.1"
[1] "##FILTER=<ID=PASS,Description="All filters passed">"
[1] "##filedate=2019.12.2"
[1] "##source=Minimac3"
[1] "##contig=<ID=1>"
[1] "##FILTER=<ID=GENOTYPED,Description="Marker was genotyped AND imputed">"
[1] "First 6 rows."
[1]
[1] "
Fixed section "
CHROM POS ID REF ALT QUAL FILTER
[1,] "8" "11740" "rs531589080" "G" "A" NA "PASS"
[2,] "8" "11774" "rs143233250" "A" "T" NA "PASS"
[3,] "8" "11788" "rs564896271" "C" "T" NA "PASS"
[4,] "8" "11789" "rs527808609" "G" "A" NA "PASS"
[5,] "8" "11816" "rs75979472" "T" "C" NA "PASS"
[6,] "8" "11879" "rs536257851" "A" "G" NA "PASS"
[1]
[1] "
Genotype section *****"
FORMAT dnl407754_icv
[1,] "GT:DS" "0|0:0.002"
[2,] "GT:DS" "1|1:1.208"
[3,] "GT:DS" "0|0:0.019"
[4,] "GT:DS" "0|0:0.007"
[5,] "GT:DS" "1|1:1.232"
[6,] "GT:DS" "0|0:0.009"
[1]
[1] "Unique GT formats:"
[1] "GT:DS"

I would upload it but the file type isn't supported

@knausb
Copy link
Owner

knausb commented Dec 17, 2021

Hi @bathycy , In the VCF specification v4.3 section 1.6.1 in subsection "3. ID" it states that the ID column should be 'unique identifiers' for each variant, when available. I feel that the reason for your error is that your data includes non-unique values in the ID column. This can be addressed as follows.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
#?vcfR
data("vcfR_test")
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

myID <- getID(vcfR_test)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE

vcf2 <- rbind2(vcfR_test, vcfR_test[1,])
vcf2
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 6 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****
myID <- getID(vcf2)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] FALSE

vcf3 <- vcf2[!duplicated(myID, incomparables = NA), ]
myID <- getID(vcf3)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE

Created on 2021-12-17 by the reprex package (v2.0.1)

Here I've loaded an example data set and validated that the ID column is unique. Note that missing values (in R = NA) are valid so they are handled here as 'incomparables'. I've then used rbind2() to add a non-unique variant, and tested this again to show that the ID column is non-unique. The simplest path may be to omit the non-unique variants, as I have demonstrated, using the duplicated() function. If you feel these duplicated variants are valuable you may want to instead develop a workflow that identifies these duplicated variants and make their IDs unique somehow, such as adding a suffix (e.g., 1, 2, 3, or a, b, c, ...).

Please let me know if this resolves your issue. Thanks!
Brian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants