Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Alvaro-Nostrum · 2024-08-06T07:48:17Z

Expected Behavior

Summary: Running linclust or clust with a very big database leads to a heavy slowdown in the rescorediagonal part. Expected the job to continue much faster. It releases a warning that says Can not touch X into main memory and the job continues running.

Current Behavior

The job is stuck at rescorediagonal with no output after several hours. The job is however accesing the indexes inside of the temporary folder.
Is there anyway to fix this? Or speed it up?

MMSeqs Output

linclust JGI JGI_nr tmp --cluster-mode 2 --cov-mode 1 -c 0.99 --min-seq-id 0.95 --split-memory-limit 300G

MMseqs Version: c498f51
Cluster mode 2
Max connected component depth 1000
Similarity type 2
Threads 96
Compressed 0
Verbosity 3
Weight file name
Cluster Weight threshold 0.9
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0.95
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.99
Coverage mode 1
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Alphabet size aa:21,nucl:5
k-mers per sequence 21
Spaced k-mers 0
Spaced k-mer pattern
Scale k-mers per sequence aa:0.000,nucl:0.200
Adjust k-mer length false
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 0
k-mer length 0
Shift hash 67
Split memory limit 300G
Include only extendable false
Skip repeating k-mers false
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Remove temporary files false
Force restart with latest tmp false
MPI runner

kmermatcher JGI tmp/14756877054557405347/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.95 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.99 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 300G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 1311052782 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Not enough memory to process at once need to split
[=================================================================] 1.31B 2h 26m 20s 97ms
Process file into 2 parts
Generate k-mers list for 1 split
[=================================================================] 1.31B 2h 34m 42s 85ms
Sort kmer 0h 0m 52s 653ms
Sort by rep. sequence 0h 0m 31s 645ms
Generate k-mers list for 2 split
[=================================================================] 1.31B 2h 36m 22s 543ms
Sort kmer 0h 0m 44s 690ms
Sort by rep. sequence 0h 0m 26s 121ms
Merge splits ... Time for fill: 1h 31m 44s 960ms
Time for merging to pref: 0h 0m 0s 6ms
Time for processing: 10h 13m 54s 576ms
rescorediagonal JGI JGI tmp/14756877054557405347/pref tmp/14756877054557405347/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.99 -a 0 --cov-mode 1 --min-seq-id 0.95 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

Can not touch 407600133816 into main memory

Your Environment

Latest precompiled AVX2 version Release 15-6f452

The text was updated successfully, but these errors were encountered:

xtj87515 · 2024-08-20T18:21:50Z

Did you fix the problem? I'm having similar issues but using easy-search

Alvaro-Nostrum · 2024-08-21T10:43:25Z

Did you fix the problem? I'm having similar issues but using easy-search

Nope :(. If you manage to fix it please tell me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Alvaro-Nostrum commented Aug 6, 2024 •

edited

Loading

xtj87515 commented Aug 20, 2024

Alvaro-Nostrum commented Aug 21, 2024

Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram. #870

Comments

Alvaro-Nostrum commented Aug 6, 2024 • edited Loading

Expected Behavior

Current Behavior

MMSeqs Output

Your Environment

xtj87515 commented Aug 20, 2024

Alvaro-Nostrum commented Aug 21, 2024

Alvaro-Nostrum commented Aug 6, 2024 •

edited

Loading