-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low CPU usage for nanopore data #771
Comments
The minimap2 behavior is the result of reproducible output under limited memory. Better multithreading efficiency would either require unstable memory usage or generate outputs in random order. If you have large memory, use a large -K (e.g. |
This was using I take your point about winnowmap vs minimap though. Is there an experiment you can share that shows this result? |
The order is not "immediately destroyed" with stable sorting. Exact reproducibility is an important requirement in government approved pipelines and in many production pipelines. Unstable memory is a big concern in a cluster setting. With maximum thread efficiency in stable output order, minimap2 could use >100GB memory in rare cases and this is unpredictable. That will cause much more problems. In general, there is not a solution perfect in all scenarios. As I said, using ~16 vCPUs will be more cost effective. Minimap2 v2.18+ also supports large -K, which may have a big effect on huge datasets. On simulated nanopore results, see this comment. On highly repetitive regions, minimap2 and winnowmap have similar evenness of coverage on CHM13 T2T. If winnowmap hasn't incorporated recent changes in minimap2 v2.20, minimap2 will be more sensitive at aligning kb-long indels. |
It's a fair point about determinism being critical in some contexts. That being said, for most of the cases where I use minimap, cost and time efficiency are what I care about. If you would prefer to not add a dynamic batching option, could you recommend a place in the code where we might fork the repo and make this change for our own internal use? Separately, if the input reads were sorted by length, this would maintain reproducibility/determinism and mitigate the issue where one long read dominates the runtime for a batch. If batches are determined by total read size instead of read count (maybe this is not the case), my instinct is that a memory explosion wouldn't happen. Does that seem right, or am I missing something? I don't know of a tool that does this, but if there were it might provide a resolution for this. |
I'll just add that the smallest RAM size for any AWS instance with 64 or more threads is already 128GB. For 96 threads, the cost of doubling the memory is only a 30% increase in hourly price: |
I encourage you to try |
PS: if you use zcat reads.fq.gz | minimap2 -cxmap-ont ref.fa - > output.paf |
We'll try this out for our next few runs and see how it goes. Thanks! |
Thanks. Lowering peak memory might be easier. On the dev branch, which is will be merged into master, I added a new option |
PS: here are the minimap2 log with and without
You can see the peak memory is reduced from 12.529 GB to 10.876 GB. The difference will be larger with more threads and a larger batch size. The md5sum shows that the output is identical with or without this option. |
What does it mean to reset the thread-local storage? It frees whatever memory was used for previous jobs within the batch and begins reallocating? Could it be that it was faster because there is less memory fragmentation? |
It was not really faster. Probably just run-to-run difference. Sorry for that. But still, on this small example, adding the option reduces memory without affecting performance. I wonder how it works on large real datasets.
Minimap2 manages thread-local allocation with kalloc (e.g. SW matrix and many transient temporary arrays). Kalloc requests memory with malloc() but it doesn't automatically give the memory back to glibc or system. You have to call km_destroy() to free all memory allocated to a kalloc handler. This new option calls km_destroy() if the kalloc memory used in a thread is above a threshold.
Kalloc memory is freed at the end of each batch. However, if a batch is too large, kalloc memory in each thread may continuously grow. If you have 64 threads and each thread holds 2GB thread-local storage, you will have 128GB on kalloc memory alone. EDIT: more exactly, after finishing the alignment of one query sequence, minimap2 checks if the kalloc memory used in that working thread is above the --cap-kalloc threshold. If it is, minimap2 frees all kalloc memory and initializes a clean kalloc handler, which is 8MB at the beginning. When minimap2 processes the next query in the same thread, kalloc memory will grow again. |
I see, thanks for the explanation. That sounds like it could help. If |
In this case, minimap2 would stage the entire input and output in memory, which will take a lot of RAM.
I worry there might be too many parallel heap allocations which may affect the multi-threading efficiency. I could be wrong, though. |
Thank you. Most of time when minimap2 spends a long time on a single query, the alignment would look bad anyway. It is not worth spending the time on them. As such, minimap2 implements a few heuristics (three, IIRC) to jump out of the alignment loop early if it takes too long. Could you identify the batch of reads that take a long time to finish based on the minimap2 log? If you can send me the reads, I can implement a new heuristic to prevent this from happening again. |
Just FYI. I fixed a memory related issue in #749. Nonetheless, I don't know how much it helps to reduce memory for human data. The long running time of some reads is unrelated to this. If you can identify these offending reads, it will be great. I tried minimap2 on NA12878 ultra-long reads but could not find reads taking extra long time. Thanks. |
yes, sorry for the lack of response. The data is not officially public and it is such a large library that it makes it a bit difficult to narrow down. I could try a fractioning or binary partitioning approach on the fastq file, I suppose. |
Is there any option for verbose output or logging from Minimap that would tell the time/ram per read id? |
I have just added a few lines to the dev branch to log the wall-clock time for each read. You can enable the log with |
awesome, thanks, I will rerun this and find the offending reads 👍 |
Alright, here are the results:
It looks like memory and time consumption (in terms of rank) are almost identical, so I have just uploaded the top 10 slowest. I will email the links separately. The file is about 10Mbp for 10 reads, so that is already a possible explanation for why these are so slow. Though some datasets have hundreds of >Mbp length reads, so it may be that these are particularly difficult for some other reason, like chimerism, repeats, or maybe something about the target sequence. |
Caused by highly repetitive minimizers on a query sequence. The solution is to filter out these query minimizers.
Thanks a lot for the examples. It is caused by highly repetitive k-mers on a query sequence. Winnowmap2 is less affected because it chooses one minimizer in a long tandem repeat. I believe the github HEAD should have addressed the issue. Please have a try. If it works, I will run more tests and then cut a new release. By the way, for your reference genome, it is recommended to add option |
To clarify, these reads will be aborted if they take too long? or has the minimizer selection changed?
rerunning the full experiment without changing any other parameters is still fairly expensive, because of the size of the library and the CPU usage, so I am not sure I can justify it to my lab unfortunately.
I see, thanks for the heads up. I didn't realize there was a default size cutoff.
Good point, thanks for the recommendation. |
I understand. Thank you so much for running the previous experiments. The examples you sent me are very useful. I will try more real data on my end. Although as I said the long running time issue is not obvious on my data, I should be able to see some improvements.
Minimizer selection is changed and I think that is the right way. Older minimap2 sets a threshold for k-mer occurrences on the reference genome but not on the query sequence. For the worst sequence in your set, which should be a sequencing artifact, there are >100,000 poly-G k-mers. This single poly-G k-mer causes all the problems. Such k-mers should be filtered out. |
No worries, thanks for taking the time to update Minimap.
Awesome, that sounds good. And that is a very interesting sequencing error, I might look more into it. |
We will probably be running similar alignments in the future, as this project continues, though, so there will be more opportunities to align this data. Has the |
Yes, |
Thank you so much for the confirmation. Minimap2 v2.23 has just been released with the improvement. I will close the issue now. |
Caused by highly repetitive minimizers on a query sequence. The solution is to filter out these query minimizers.
Because of how minimap2 creates batches, each batch takes as long as the longest read takes to map. This is particularly poorly suited for HMW nanopore sequencing, since its read length distribution has a long right tail. This effectively creates the worst case scenario for CPU efficiency, where the vast majority of each batch is mapped long before the longest read finishes. My latest (winnowmap) run took 38hrs to map:
The average CPU usage was 25% (up to the beginning of sorting) and the cost is $138 on the lowest memory AWS instance that supports 96 threads.
Can we have an option to sort batches by length, or use dynamic load balancing?
The text was updated successfully, but these errors were encountered: