vcfR2DNAbin() - problem with DNAbin list type #122

fdchevalier · 2018-12-21T19:31:10Z

Hi Brian,

I found an unexpected behavior of vcfR2DNAbin(). I would not qualify this as a bug per se because it is coming from a particular behavior of a DNAbin object created by one function from the ape package (but I don't know if this is a bug in this function as I don't know the specifications of the DNAbin class). Anyway I found a solution for vcfR to handle all cases.

The unexpected behavior

I wanted to include my variants in a reference sequence using vcfR2DNAbin(). This reference sequence was extracted from a reference genome imported using read.dna() from the ape package. The result produced by vcfR2DNAbin() contained a sequence of the right length but with NA for all the invariant bases instead of the expected reference bases. When using the example of the vcfR manual, everything worked perfectly fine.

The explanation

The behavior occurs only when the sequence is stored as a list in the DNAbin object. The list is used when sequences of different lengths are loaded in the same DNAbin object which is the case of my reference genome. A sequence extracted from the reference genome will inherit the list type. But the behavior also depends on what function is used for creating the list: if as.DNAbin() is used, everything works fine with vcfR2DNAbin() but if read.dna() is used, the behavior occurs.

Here is a reproducible example inspired from vcfR manual:

library(ape)
data(vcfR-test)

set.seed(9)

vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Reference sequence
nucs <- c('a','c','g','t')
mySeq <- nucs[round(runif(n=20, min=0.5, max=4.5))]

myRef.mat <- matrix(mySeq, nrow=1)
rownames(myRef.mat) <- "test"

myRef.ls <- list(mySeq)
names(myRef.ls) <- "test"

## Fasta file for read.dna()
cat(">test",
    paste(mySeq, collapse=""),
    file = "exdna.fas", sep="\n")

# Loading reference sequence as matrix
myRef.db <- as.DNAbin(myRef.mat)
myRef.rd <- read.dna("exdna.fas", format="fasta")

# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# TRUE

# Loading reference sequence as list
myRef.db <- as.DNAbin(myRef.ls)
myRef.rd <- read.dna("exdna.fas", format="fasta", as.matrix=FALSE)

# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# FALSE

# Results from vcfR2DNA
myDNA.db <- vcfR2DNAbin(vcfR_test, ref.seq=myRef.db)
myDNA.rd <- vcfR2DNAbin(vcfR_test, ref.seq=myRef.rd)

as.character(myDNA.db)
as.character(myDNA.rd)

The problem comes from the way the data is accessible from each object. This has a big impact on vcfR2DNAbin(). The data loaded with as.DNAbin() is never directly accessible (myRef.db[[1]]) contrary to the data loaded with read.dna() (myRef.rd[[1]]). So when as.character() is used (line 262 of vcfR2DNAbin()), the binary code is translated in nucleotides in the first case but it is kept as is in the second case. When vcfR2DNAbin() recreates a DNAbin object, in the first case, it translates back the nucleotides into binary code but in the second case, as the binary code is already present, it translates the binary numbers into NA because they do not correspond to any nucleotides.

Here is an illustration:

as.character(myRef.db[[1]])
as.character(myRef.rd[[1]])

The remedy

As I wrote, it might be a bug from read.dna() but I found a work around to accommodate any cases when a DNAbin object is of list type. The first step is is to convert the first item of the list to a sequence without trying to access the item itself (so the sequence will still be of list type), unlist the sequence and store the vector for later use. This implies also to modify the if statement of the is.matrix(ref.seq) to obtain the sequence and to modify the way the matrix of reference sequences is created.

Here are the modifications proposed:

# Replace lines 159-162 with this block
if (is.list(ref.seq)) {
    ref.seq <- unlist(as.character(ref.seq[1]), use.names=FALSE)
}

# Replace lines 163-166 with this block
if (is.matrix(ref.seq)) {
	ref.seq <- ref.seq[1, ]
	ref.seq <- ref.seq[1:ncol(ref.seq)]
	ref.seq <- as.character(ref.seq)
}

# Replace line 262 with this line
x <- matrix(ref.seq, nrow = length(ref.seq),

Fred

The text was updated successfully, but these errors were encountered:

knausb · 2018-12-22T18:06:13Z

Hi @fdchevalier, thank you for the post and thank you for taking the time to explain the situation and propose a solution! I am currently on holiday, so I won't be able to make much progress on this now. But I'll leave this open so I can revisit it in the new year. Happy holidays!

fdchevalier · 2018-12-22T18:13:53Z

Hi Brian,

No worries, this can wait next year. I have already made the modifications on the function I use anyway. Enjoy your vacations.

Happy holidays!

fdchevalier mentioned this issue Mar 16, 2019

vcfR2DNAbin() - no sequence label whith DNAbin list type #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcfR2DNAbin() - problem with DNAbin list type #122

vcfR2DNAbin() - problem with DNAbin list type #122

fdchevalier commented Dec 21, 2018

knausb commented Dec 22, 2018

fdchevalier commented Dec 22, 2018

vcfR2DNAbin() - problem with DNAbin list type #122

vcfR2DNAbin() - problem with DNAbin list type #122

Comments

fdchevalier commented Dec 21, 2018

The unexpected behavior

The explanation

The remedy

knausb commented Dec 22, 2018

fdchevalier commented Dec 22, 2018