You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found an unexpected behavior of vcfR2DNAbin(). I would not qualify this as a bug per se because it is coming from a particular behavior of a DNAbin object created by one function from the ape package (but I don't know if this is a bug in this function as I don't know the specifications of the DNAbin class). Anyway I found a solution for vcfR to handle all cases.
The unexpected behavior
I wanted to include my variants in a reference sequence using vcfR2DNAbin(). This reference sequence was extracted from a reference genome imported using read.dna() from the ape package. The result produced by vcfR2DNAbin() contained a sequence of the right length but with NA for all the invariant bases instead of the expected reference bases. When using the example of the vcfR manual, everything worked perfectly fine.
The explanation
The behavior occurs only when the sequence is stored as a list in the DNAbin object. The list is used when sequences of different lengths are loaded in the same DNAbin object which is the case of my reference genome. A sequence extracted from the reference genome will inherit the list type. But the behavior also depends on what function is used for creating the list: if as.DNAbin() is used, everything works fine with vcfR2DNAbin() but if read.dna() is used, the behavior occurs.
Here is a reproducible example inspired from vcfR manual:
library(ape)
data(vcfR-test)
set.seed(9)
vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))
# Reference sequencenucs<- c('a','c','g','t')
mySeq<-nucs[round(runif(n=20, min=0.5, max=4.5))]
myRef.mat<-matrix(mySeq, nrow=1)
rownames(myRef.mat) <-"test"myRef.ls<-list(mySeq)
names(myRef.ls) <-"test"## Fasta file for read.dna()
cat(">test",
paste(mySeq, collapse=""),
file="exdna.fas", sep="\n")
# Loading reference sequence as matrixmyRef.db<- as.DNAbin(myRef.mat)
myRef.rd<- read.dna("exdna.fas", format="fasta")
# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# TRUE# Loading reference sequence as listmyRef.db<- as.DNAbin(myRef.ls)
myRef.rd<- read.dna("exdna.fas", format="fasta", as.matrix=FALSE)
# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# FALSE# Results from vcfR2DNAmyDNA.db<- vcfR2DNAbin(vcfR_test, ref.seq=myRef.db)
myDNA.rd<- vcfR2DNAbin(vcfR_test, ref.seq=myRef.rd)
as.character(myDNA.db)
as.character(myDNA.rd)
The problem comes from the way the data is accessible from each object. This has a big impact on vcfR2DNAbin(). The data loaded with as.DNAbin() is never directly accessible (myRef.db[[1]]) contrary to the data loaded with read.dna() (myRef.rd[[1]]). So when as.character() is used (line 262 of vcfR2DNAbin()), the binary code is translated in nucleotides in the first case but it is kept as is in the second case. When vcfR2DNAbin() recreates a DNAbin object, in the first case, it translates back the nucleotides into binary code but in the second case, as the binary code is already present, it translates the binary numbers into NA because they do not correspond to any nucleotides.
As I wrote, it might be a bug from read.dna() but I found a work around to accommodate any cases when a DNAbin object is of list type. The first step is is to convert the first item of the list to a sequence without trying to access the item itself (so the sequence will still be of list type), unlist the sequence and store the vector for later use. This implies also to modify the if statement of the is.matrix(ref.seq) to obtain the sequence and to modify the way the matrix of reference sequences is created.
Here are the modifications proposed:
# Replace lines 159-162 with this blockif (is.list(ref.seq)) {
ref.seq<- unlist(as.character(ref.seq[1]), use.names=FALSE)
}
# Replace lines 163-166 with this blockif (is.matrix(ref.seq)) {
ref.seq<-ref.seq[1, ]
ref.seq<-ref.seq[1:ncol(ref.seq)]
ref.seq<- as.character(ref.seq)
}
# Replace line 262 with this linex<-matrix(ref.seq, nrow= length(ref.seq),
Fred
The text was updated successfully, but these errors were encountered:
Hi @fdchevalier, thank you for the post and thank you for taking the time to explain the situation and propose a solution! I am currently on holiday, so I won't be able to make much progress on this now. But I'll leave this open so I can revisit it in the new year. Happy holidays!
Hi Brian,
I found an unexpected behavior of
vcfR2DNAbin()
. I would not qualify this as a bug per se because it is coming from a particular behavior of a DNAbin object created by one function from theape
package (but I don't know if this is a bug in this function as I don't know the specifications of the DNAbin class). Anyway I found a solution forvcfR
to handle all cases.The unexpected behavior
I wanted to include my variants in a reference sequence using
vcfR2DNAbin()
. This reference sequence was extracted from a reference genome imported usingread.dna()
from theape
package. The result produced byvcfR2DNAbin()
contained a sequence of the right length but withNA
for all the invariant bases instead of the expected reference bases. When using the example of thevcfR
manual, everything worked perfectly fine.The explanation
The behavior occurs only when the sequence is stored as a list in the DNAbin object. The list is used when sequences of different lengths are loaded in the same DNAbin object which is the case of my reference genome. A sequence extracted from the reference genome will inherit the list type. But the behavior also depends on what function is used for creating the list: if
as.DNAbin()
is used, everything works fine withvcfR2DNAbin()
but ifread.dna()
is used, the behavior occurs.Here is a reproducible example inspired from
vcfR
manual:The problem comes from the way the data is accessible from each object. This has a big impact on
vcfR2DNAbin()
. The data loaded withas.DNAbin()
is never directly accessible (myRef.db[[1]]
) contrary to the data loaded withread.dna()
(myRef.rd[[1]]
). So whenas.character()
is used (line 262 ofvcfR2DNAbin()
), the binary code is translated in nucleotides in the first case but it is kept as is in the second case. WhenvcfR2DNAbin()
recreates a DNAbin object, in the first case, it translates back the nucleotides into binary code but in the second case, as the binary code is already present, it translates the binary numbers intoNA
because they do not correspond to any nucleotides.Here is an illustration:
The remedy
As I wrote, it might be a bug from
read.dna()
but I found a work around to accommodate any cases when a DNAbin object is of list type. The first step is is to convert the first item of the list to a sequence without trying to access the item itself (so the sequence will still be of list type), unlist the sequence and store the vector for later use. This implies also to modify theif
statement of theis.matrix(ref.seq)
to obtain the sequence and to modify the way the matrix of reference sequences is created.Here are the modifications proposed:
Fred
The text was updated successfully, but these errors were encountered: