Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Open
Alvaro-Nostrum opened this issue Aug 6, 2024 · 2 comments

Comments

@Alvaro-Nostrum
Copy link

Alvaro-Nostrum commented Aug 6, 2024

Expected Behavior

Summary: Running linclust or clust with a very big database leads to a heavy slowdown in the rescorediagonal part. Expected the job to continue much faster. It releases a warning that says Can not touch X into main memory and the job continues running.

Current Behavior

The job is stuck at rescorediagonal with no output after several hours. The job is however accesing the indexes inside of the temporary folder.
Is there anyway to fix this? Or speed it up?

MMSeqs Output

linclust JGI JGI_nr tmp --cluster-mode 2 --cov-mode 1 -c 0.99 --min-seq-id 0.95 --split-memory-limit 300G

MMseqs Version: c498f51
Cluster mode 2
Max connected component depth 1000
Similarity type 2
Threads 96
Compressed 0
Verbosity 3
Weight file name
Cluster Weight threshold 0.9
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0.95
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.99
Coverage mode 1
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Alphabet size aa:21,nucl:5
k-mers per sequence 21
Spaced k-mers 0
Spaced k-mer pattern
Scale k-mers per sequence aa:0.000,nucl:0.200
Adjust k-mer length false
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 0
k-mer length 0
Shift hash 67
Split memory limit 300G
Include only extendable false
Skip repeating k-mers false
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Remove temporary files false
Force restart with latest tmp false
MPI runner

kmermatcher JGI tmp/14756877054557405347/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.95 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.99 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 300G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 --cluster-weight-threshold 0.9

kmermatcher JGI tmp/14756877054557405347/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.95 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.99 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 300G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 1311052782 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Not enough memory to process at once need to split
[=================================================================] 1.31B 2h 26m 20s 97ms
Process file into 2 parts
Generate k-mers list for 1 split
[=================================================================] 1.31B 2h 34m 42s 85ms
Sort kmer 0h 0m 52s 653ms
Sort by rep. sequence 0h 0m 31s 645ms
Generate k-mers list for 2 split
[=================================================================] 1.31B 2h 36m 22s 543ms
Sort kmer 0h 0m 44s 690ms
Sort by rep. sequence 0h 0m 26s 121ms
Merge splits ... Time for fill: 1h 31m 44s 960ms
Time for merging to pref: 0h 0m 0s 6ms
Time for processing: 10h 13m 54s 576ms
rescorediagonal JGI JGI tmp/14756877054557405347/pref tmp/14756877054557405347/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.99 -a 0 --cov-mode 1 --min-seq-id 0.95 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

Can not touch 407600133816 into main memory

Your Environment

Latest precompiled AVX2 version Release 15-6f452

@xtj87515
Copy link

Did you fix the problem? I'm having similar issues but using easy-search

@Alvaro-Nostrum
Copy link
Author

Did you fix the problem? I'm having similar issues but using easy-search

Nope :(. If you manage to fix it please tell me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants