-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping simulated T2T reads #11
Comments
I have also tried winnowmap2 and minimap2 on real CHM13 data. I mapped the first 1 million nanopore reads in rel7 to CHM13, counted read start positions in each 100kb window (only the best scoring hit for each read), and collected outlier windows. On this test, winnowmap2 and minimap2 are about the same in terms of the number of outlier windows, though their outliers are sometimes different. I have done a similar experiment with one run of HiFi reads. The result is similar: the number of outlier windows are comparable between winnowmap2 and minimap2. |
@lh3 Hi, |
@yipukangda I guess winnowmap is optimized for recent long reads which have decent accuracy. If you use older nanopore chemistry or older base callers, the high error rate may affect the mapping sensitivity. |
@lh3 Thanks for sharing these results; I will look into it and check whether higher minimizer window size is the culprit. You're right that minimap2 does slightly better using BTW, the |
Correction: I just noticed that winnowmap also outputs secondary alignment just at a much lower frequency. I have updated the results. Now minimap2 and winnowmap racon results are about the same, though htsbox analysis still favors minimap2 a little. Here is a comparison for CHM13 HiFi data. I mapped HiFi reads to T2T v1.0 with minimap2 -ax asm20 --secondary=no -t16 CHM13v1.fa reads.fq > mm.sam
winnowmap -ax map-pb --secondary=no -W CHM13-rep-k15.txt -t16 CHM13v1.fa reads.fq > wm.sam Then I ran racon to generate consensus and mapped the consensus to T2T to call "variants": racon -t16 reads.fq mm.sam CHM13v1.fa > wm.racon.fa
minimap2 -t 24 -cxasm5 -K1.9g --cs -r2k CHM13v1.fa wm.racon.fa > wm.racon.asm5.paf
sort -k6,6 -k8,8n wm.racon.asm5.paf | paftools.js call - > wm.racon.asm5.var 2> wm.racon.asm5.vst Here is the 3041564123 reference bases covered by exactly one contig
3453 substitutions; ts/tv = 0.798
1588 1bp deletions
25392 1bp insertions
461 2bp deletions
3463 2bp insertions
659 [3,50) deletions
1368 [3,50) insertions
29 [50,1000) deletions
38 [50,1000) insertions
0 >=1000 deletions
1 >=1000 insertions and for winnowmap: 3041521256 reference bases covered by exactly one contig
3329 substitutions; ts/tv = 0.775
1659 1bp deletions
25475 1bp insertions
454 2bp deletions
3446 2bp insertions
626 [3,50) deletions
1347 [3,50) insertions
30 [50,1000) deletions
46 [50,1000) insertions
0 >=1000 deletions
1 >=1000 insertions If we assume the T2T reference is perfect, minimap2 is no worse than winnowmap. The process above involves racon and assembly-to-reference alignment. In case these two steps going wrong, I tried a simpler version of variant calling: samtools sort -o wm.bam -@4 -m4g wm.sam
htsbox pileup -vcf CHM13v1.fa -s5 -q5 -Q30 wm.bam > wm.vcf # baseQ>=30, mapQ>=5, at least 5 supporting reads In this analysis, minimap2 also reports fewer "SNPs" and fewer overall "variants", and reports fewer "SNPs" with coverage >=70. |
Thanks! These observations are useful to know. The simulation settings we chose in in Winnowmap2 preprint were different from the above, we wanted to showcase better mappability in presence of non-reference alleles. For this, we had added SVs (using SURVIVOR) in reference before beginning the mapping process. More details are available in Winnowmap2 preprint. Precise commands and software versions used are available in its supplement. |
Thanks for the response. What I did is like sanity check. It doesn't address SVs. On your preprint, I think a basic assumption is that PSVs are fixed in the human population. When PSVs are polymorphic, you will have an event like below: -TTCG-TTCG- (ancestral)
-TTCG-TTXG- (sample)
-ATCG-TTXG- (reference) Here the SV For recent duplications, polymorphic PSVs may happen often with a simple mutation process. For a bit ancient duplications, there can be non-allelic gene conversions that copy sequences between duplications, leading to polymorphic PSVs. I don't know how prevalent such events are, though. What matters is the accuracy on real data. If the goal is SV calling at a higher sensitivity, a simpler strategy is to recompute alignment scores with a better concave gap cost and rank alignments with the new score. If you count an INDEL as one edit regardless of its length, you are more likely to correctly map a read involving an SV. Smith-Waterman may lead to a wrong alignment mostly because its affine-gap cost doesn't properly model the mutation process of SVs. No matter how hard we try, there will always be ambiguity in read mapping within segupds. The ultimate solution is assembly such that we can put SVs in the context of long unique sequences. |
I am closing this thread as it is outdated. I have a new preprint on recent improvements in minimap2 if you are interested. |
TL;DR: on simulated reads, winnowmap seems to produce more mapping errors than minimap2.
I simulated reads with pbsim2:
This supposedly simulates reads similar to nanopore of the current generation. The second command line converts the pbsim2 output to the FASTA format needed by
paftools.js mapeval
. I then mapped these reads to CHM13v1 with the following mappers/settings:Here are the mapeval output for
pbsim-wm-ont.paf
:for
pbsim-wm-pb.paf
:and for
pbsim-mm-ont.paf
:Notably, minimap2 maps slightly more reads and has higher accuracy at the Q60, Q5, Q1 and Q0 mapping quality thresholds.
I guess the large minimizer window size used by winnowmap is affecting its accuracy.
The text was updated successfully, but these errors were encountered: