Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefilter step died when running easy-search: Segmentation fault (core dumped) #882

Open
szimmerman92 opened this issue Aug 26, 2024 · 2 comments

Comments

@szimmerman92
Copy link

Expected Behavior

easy-search should finish execution without errors

Current Behavior

Error during pre-filter step

Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
Segmentation fault (core dumped) ] 0.00% 1 eta -
Error: Prefilter died
Error: Search step died
Error: Search died

Steps to Reproduce (for bugs)

First create a custom nucleotide database

mmseqs createdb --dbtype 2 --compressed 1 refseq_bacteria_archaea_fungi_viral.fna.gz seqTaxDB
mmseqs createtaxdb seqTaxDB tmp --ncbi-tax-dump ncbi-taxdump --tax-mapping-file fastaid_taxid.tsv

Next run easy-search

mmseqs easy-search all_nuc.fasta seqTaxDB tax_assignments.txt tmp --search-type 3 --min-seq-id 0.65 -e 0.01 -c 0.8 --cov-mode 2 --threads 16

MMseqs Output (for bugs)

Below is the output of easy-search

easy-search all_nuc.fasta seqTaxDB tax_assignments.txt tmp --search-type 3 --min-seq-id 0.65 -e 0.01 -c 0.8 --cov-mode 2 --threads 16

MMseqs Version: 8ef39f4
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.01
Seq. id. threshold 0.65
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.8
Coverage mode 2
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Threads 16
Compressed 0
Verbosity 3
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 5.7
k-mer length 0
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 3
Search iterations 1
Start sensitivity 4
Search steps 1
Prefilter mode 0
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files true
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
Overlap threshold 0
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 0
Greedy best hits false

createdb all_nuc.fasta tmp/7701176895607249840/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

Converting sequences
[1335322] 2s 17mss
Time for merging to query_h: 0h 0m 0s 221ms
Time for merging to query: 0h 0m 1s 64ms
Database type: Nucleotide
Time for processing: 0h 0m 4s 959ms
Create directory tmp/7701176895607249840/search_tmp
search tmp/7701176895607249840/query seqTaxDB tmp/7701176895607249840/result tmp/7701176895607249840/search_tmp --alignment-mode 3 -e 0.01 --min-seq-id 0.65 -c 0.8 --cov-mode 2 --threads 16 -s 5.7 --search-type 3 --remove-tmp-files 1

splitsequence seqTaxDB tmp/7701176895607249840/search_tmp/9045538653068861586/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 22.15M 12s 856ms
Time for merging to target_seqs_split_h: 0h 0m 31s 837ms
Time for merging to target_seqs_split: 0h 0m 35s 517ms
Time for processing: 0h 1m 59s 373ms
extractframes tmp/7701176895607249840/query tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs --forward-frames 1 --reverse-frames 1 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 1.34M 0s 620ms
Time for merging to query_seqs_h: 0h 0m 0s 734ms
Time for merging to query_seqs: 0h 0m 2s 576ms
Time for processing: 0h 0m 5s 91ms
splitsequence tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 2.67M 0s 919ms
Time for merging to query_seqs_split_h: 0h 0m 0s 832ms
Time for merging to query_seqs_split: 0h 0m 0s 878ms
Time for processing: 0h 0m 3s 919ms
prefilter tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs_split tmp/7701176895607249840/search_tmp/9045538653068861586/target_seqs_split tmp/7701176895607249840/search_tmp/9045538653068861586/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 15 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 16 --compressed 0 -v 3 -s 5.7

Query database size: 2670930 type: Nucleotide
Target split mode. Searching through 18 splits
Estimated memory consumption: 326G
Target database size: 100684280 type: Nucleotide
Process prefiltering step 1 of 18

Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
Segmentation fault (core dumped) ] 0.00% 1 eta -
Error: Prefilter died
Error: Search step died
Error: Search died

Context

Hi I am trying to run an nucleotide-nucleotide search in mmseq2 with a custom database. This error does not occur with a different, smaller nucleotide database.

Thank you very much for this amazing tool and all your hard work.

Your Environment

I am using a google cloud VM with 64 CPUs and 416 GBs of memory on an ubuntu operating system, version 20.04.

I install mmseq with the command

static build with AVX2 (fastest)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

@github-staff github-staff deleted a comment Aug 27, 2024
@yuvaranimasarapu
Copy link

I have the same error when running a NT search in mmseq2 NT NCBI database. I am running on our internal server with 256 GB memory.

@jasmezz
Copy link

jasmezz commented Sep 26, 2024

I've encountered segfault errors with mmseqs due to not enough memory (which is a valid reason for segfaults, according to quick web search). Large databases like NT/GTDB might need around 900GB RAM, so I would guess too little RAM is the reason in your cases as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@yuvaranimasarapu @szimmerman92 @jasmezz and others