-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merqury over-estimates genome completeness #2
Comments
Hi Jessen, Thanks for your thoughtful feedbacks. i) Larger genome size than expected ii) Including ploidy for weighting I'll wrap my head around for this, marking your suggestion as 'enhancement'. For some genomes, it is not so easy to automatically identify the 'diploid' peak and requires manual inspection. I will probably implement this as a separate script, taking the 'diploid' peak as an option. Thanks again! |
Hi @arangrhie, Regarding (ii): My past experience trying to do this in an automated way leads me to agree with you. The methods implemented in GenomeScope2 are the best I've seen in this regard. |
Hi @bredeson, Thanks for sharing your spectra-cn plot! This helped me understand your assembly. As a side note: If you are interested to investigate further on the missing k-mers, you could pull out the reads having those missing k-mers and do a local assembly.
missing.fasta will contain reads having the missing k-mers with >X multiplicity from your spectra-cn plot. The missing sequences could probably coming from the other haplotype, but who knows. And (ii) - I agree GenomeScope2 is a very nice tool in this regard. Combining it with Merqury might be a nice future work. Thanks for your hint. Arang |
Hey,
Great tool, it's very useful and easy to use!
I ran Merqury on an Illumina-based assembly I had and obtained a completeness estimate much larger than I expected, given the genome's known size. I think I found a semantic bug in it's completeness calculation:
When calculating the size of a genome with k-mers, the calculation must take into account the multiplicity of the k-mer as well as its frequency in order to normalize for genomic copy number. As Merqury is currently written, it only sums up the frequency and doesn't take copy number into account (it is basically assuming all k-mers have a copy number of 1 in the genome).
The equation for calculating the fraction of genome assembled is:
where:
G
is the fraction of genome assembled (i.e., the completeness)k_i
is the value at multiplicityi
k_d
is the value of the diploid peak multiplicityf_a(k_i)
is the frequency at multiplicityk_i
in the assemblyf_r(k_i)
is the frequency at multiplicityk_i
in the readsSee
spectra-cn.sh
, lines 132-134:If we let
$dmult
to be the full-depth diploid peak multiplicity, then the above lines could be modified to instead be:I hope this is useful.
Best,
Jessen
The text was updated successfully, but these errors were encountered: