Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding genetic_diff() number of alleles output #185

Open
EveTC opened this issue Jun 11, 2021 · 4 comments
Open

Understanding genetic_diff() number of alleles output #185

EveTC opened this issue Jun 11, 2021 · 4 comments

Comments

@EveTC
Copy link

EveTC commented Jun 11, 2021

Hello

I am somewhat confused by the output by geneitc_diff(), in particular the number of alleles in each population.

Given the example data below:

library(vcfR)
data(vcfR_example)

myPops <- as.factor(rep(c('a','b'), each = 9))
myDiff <- genetic_diff(vcf, myPops, method = "nei")

head(myDiff)

Output:

CHROM POS Hs_a Hs_b Ht n_a n_b Gst Htmax Gstmax Gprimest
Supercontig_1.50 2 0.0000000 0.1800000 0.07986111 14 10 0.06086957 0.5173611 0.8550336 0.07118968
Supercontig_1.50 246 0.4200000 0.2777778 0.35123967 10 12 0.02509804 0.6652893 0.4853002 0.05171652
Supercontig_1.50 549 0.5000000 0.0000000 0.44444444 4 2 0.25000000 0.6666667 0.5000000 0.50000000
Supercontig_1.50 668 0.4444444 0.0000000 0.50000000 12 4 0.33333333 0.6250000 0.4666667 0.71428571
Supercontig_1.50 765 0.0000000 0.2187500 0.11072664 18 16 0.07031250 0.5467128 0.8117089 0.08662281
Supercontig_1.50 780 0.0000000 0.2448980 0.12444444 16 14 0.08163265 0.5511111 0.7926267 0.10299003

To get more information about the oubject vcf I converted to a genInd object.

GenInd <- vcfR2genind(vcf)
GenInd

/// GENIND OBJECT /////////

 // 18 individuals; 2,533 loci; 5,083 alleles; size: 1.9 Mb

 // Basic content
   @tab:  18 x 5083 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 1-5)
   @loc.fac: locus factor for the 5083 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: adegenet::df2genind(X = t(x), sep = sep)

 // Optional content
   - empty -

# What is range of number of alleles
n_perPop <- as.data.frame(t(myDiff[,c(6:7)]))
n_perPop$Site <- gsub("n_", "", rownames(n_perPop))
nPerpop.m <- reshape2::melt(n_perPop, id.vars = "Site")
colnames(nPerpop.m) <- c("Site", "SNP_number", "n_alleles")
head(nPerpop.m)
range(nPerpop.m$n_alleles)
#0-18

So from this additional information I can see that there is a range of 1-5 alleles per locus and that the data is diploid.
Therefore, from my understanding Supercontig_1.50:549 (CHROM:POS) has 2 alleles and it is 4 because (2x2(diploid)=4). Is this description correct?

If so, I am confused with my vcf output

GenInd
/// GENIND OBJECT /////////

 // 761 individuals; 3,857 loci; 7,714 alleles; size: 24.8 Mb

 // Basic content
   @tab:  761 x 7714 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-2)
   @loc.fac: locus factor for the 7714 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: adegenet::df2genind(X = t(x), sep = sep)

 // Optional content
   @pop: population of each individual (group size range: 5-20)

# Get range of n for my vcf
n_perPop <- as.data.frame(t(myDiff[,c(49:93)]))
n_perPop$Site <- gsub("n_", "", rownames(n_perPop))
nPerpop.m <- melt(n_perPop, id.vars = "Site")
colnames(nPerpop.m) <- c("Site", "SNP_number", "n_alleles")

range(nPerpop.m$n_alleles)
# 0 40

If my vcf has a max range of 2 alleles per locus and is a diploid organism then surely the max number of alleles should be 4?? However the range of number of alleles is 0-40??

My appologies if this is a simplistic question, I am new to this sort of analysis.
Any advice to shed light on this would be greatly appreciated.

Thank you

@knausb
Copy link
Owner

knausb commented Jun 11, 2021

Hi @EveTC , I suspect the issue is how missing data are handled. I've built the following example based on your above code.

> library(vcfR)
> data(vcfR_example)
> vcf
***** Object of Class vcfR *****
18 samples
1 CHROMs
2,533 variants
Object size: 3.2 Mb
8.497 percent missing data
*****        *****         *****
> getPOS(vcf)[1:10]
 [1]    2  246  549  668  765  780  989 1670 1692 1775
> gt <- extract.gt(vcf)
> table(gt[3, ])

0|0 1|1 
  2   1 

We see that position (POS) 549 is the third variant in the file, so we can extract the genotypes and query variant (row) three). We see that there are two, phased, diploid genotypes. They are diploid because there are two integer alleles for each genotype. The genotypes are phased because the alleles are delimited with "|" instead of "/". We see a total of three genotypes even though we have 18 samples. This means the other samples were missing genotypes. Because we have 2 pf the 0|0 genotype we have a total of 4 of the 0 allele. Because we have one of the 1|1 genotypes we have a total of two of the 1 allele. I believe adegenet handles missing data as another allelic state. But I suggest you consult it's documentation. How to handle missing data is one of those important details that's easy to forget to pay attention to.

Does that make sense?
Brian

@EveTC
Copy link
Author

EveTC commented Jun 14, 2021

Hi Brian,

Thank you for your explanation, I think I follow what you are saying.

How does vcfR handle missing data when calculating Hs in the genetic_diff()? How do I know that the missing data has been read in correctly?

Thanks again,
Eve

@knausb
Copy link
Owner

knausb commented Jun 15, 2021

Hi @EveTC , vcfR::genetic_diff() simply ignores missing data. That's why the number of alleles is different for some variants even though the sample size is constant. This can be validated by looking at the "n_*" columns. Does this address your question?

@EveTC
Copy link
Author

EveTC commented Jun 16, 2021

Hi Brian,

Yes it does - thank you for your explanation. I have a better understanding now.
Eve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants