Memory leakage when using abundance file #208

cimendes · 2021-07-13T08:27:16Z

Hello!

I've been using InSilicoSeq to generate mock communities for a project of my own to assess assembly quality (https://github.com/cimendes/LMAS).

To match the distribution of a real community, I've computed an abundance file to use with insilicoseq. Unfortunately, when using this option, I have the following issue:

UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

The iss execution never progresses. I've tried running it in a computer note with 250 Gb of available memory and the issue still pressists. Any assistance is very much appreciated.

The command that I'm running:

iss generate --genomes ZymoBIOMICS_genomes.fasta --output LMS --abundance_file Zymos\ mock\ Log\ Samples\ Abundance\ -\ Abundance\ file\ LOG.tsv --cpus 40 -n 95665106 --model miseq

The abundance file passed is available here. The complete genomes are available here

Thank you very much for your assistance!

The text was updated successfully, but these errors were encountered:

andersgs · 2021-07-13T14:53:19Z

@cimendes I am trying to replicate what you did.

I found windows line endings in your abundance file. That might throw things off a big (dos2unix is your go to here).
After passing the abundance file through dos2unix I am getting an error about missing samples in the abundance file (sequences found in the FASTA but not listed in the abundance file:

ERROR:iss.app:Fasta record not found in abundance file: 'BS.pilon.polished.v3.ST170922'

I hope this helps some. Let me know if you want to keep troubleshooting.

cimendes · 2021-07-13T15:13:49Z

Ops, I think something got lost in translation when trying to link things from google sheets as I only have a Linux workstation... :P And indeed I linked the wrong reference file! I'm sorry about that! (I forgot to update the file in the zenodo repository). But here it is. Thank you for the help in debugging this! 😍

andersgs · 2021-07-13T16:31:42Z

@cimendes I am still waiting for it to finish running, but I think the issue is BioPython 1.79 (issue #207). So, by downgrading to BioPython 1.78, and using the FASTA you posted in the last message along with removing the windows \r\n endings (which may not be an issue) I was able to get it to run.

This is as far as it has gotten:

iss generate --genomes Zymo.fasta --output LMS --abundance_file abund.tsv --cpus 40 -n 95665106 --model miseq
INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using abundance file:abund.tsv
INFO:iss.app:Using 40 cpus for read generation
INFO:iss.app:Generating 95665106 reads
INFO:iss.app:Generating reads for record: Bacillus_subtilis
INFO:iss.app:Generating reads for record: Enterococcus_faecalis
INFO:iss.app:Generating reads for record: Escherichia_coli_plasmid
INFO:iss.app:Generating reads for record: Escherichia_coli
INFO:iss.app:Generating reads for record: Lactobacillus_fermentum
INFO:iss.app:Generating reads for record: Listeria_monocytogenes

I am waiting for it to finish. But this is the furthest I have been able to get with it so far.

But BioPython 1.79 introduced a number of Deprecations and changes that seem to be breaking a lot of scripts. It should've been a major release (at least 1.80) to signal these issues.

To downgrade BioPython, you can use conda install biopython=1.78 or pip install biopython=1.78 depending on how you are managing the environment.

My best guess at was is happening is that BioPython seems to have changed the defaults of how Seq objects store their data to bytearrays: https://github.com/biopython/biopython/blob/master/NEWS.rst#3-june-2021-biopython-179 --- everything is expecting str though. And, they changed the way the data is accessed, which probably would fix this issue. But, essentially, BioPython 1.79 is not really backwards compatible.

cimendes · 2021-07-13T16:34:22Z

@andersgs name your price! I'll ship it to Australia! I'll totally try that. Thank you so much!

andersgs · 2021-07-13T16:45:29Z

Hahaha... You are welcome. But, don't thank me yet. Make sure it works. :)

andersgs · 2021-07-13T17:42:54Z

@cimendes, I am sorry to say, it has now run long enough that I have reached the same error:

Traceback (most recent call last):
  File "/home/agoncalves/.conda/envs/iss/bin/iss", line 10, in <module>
    sys.exit(main())
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/site-packages/iss/app.py", line 608, in main
    args.func(args)
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/site-packages/iss/app.py", line 306, in generate_reads
    record_file_name_list = Parallel(
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/home/agoncalves/.conda/envs/iss/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

andersgs · 2021-07-13T17:54:34Z

I am trying downgrading joblib to 0.17.0 --- as per the pipenv.lock file.

andersgs · 2021-07-14T17:42:04Z

Hummm... @cimendes still no joy for me. Any luck for you?

I am trying a slightly different approach. I cloned the repo, and created an environment using pipenv and the Pipfile.lock. Will run within this environment. Something about the dependencies might be causing problems.

I think we also need to have a way of either capturing any errors in parallel subprocesses or running in series to see what is causing the bug.

cimendes · 2021-07-14T17:58:20Z

I canceled my job after your post. :( I did manage to create a sample using the coverage file without having this memory leakage error, but this option doesn't allow me to set the total read number so I need to do some math to compensate. If this works the problem is isolated to using an abundance file. Or the combination of using the abundance file and a very high read number. I'll keep you posted!

HadrienG · 2021-07-28T05:55:59Z

Hej folks!

Thanks for reporting and trying to debug this. I just came back from vacation and will take a look.

/Hadrien

novitch · 2022-09-08T21:51:57Z

Hi guys, no news for this issue? I'm having the same trouble for the first time despite using your great tool since 2 years.

I have an abundance file with 5 genomes and would like to generate 10 millions reads

iss generate --cpus 10 --genomes genomes_for_test.fna --abundance_file simulation_matrice.tsv --model hiseq --output simulation_5cacnes --n_reads 10000000

My RAM dramatically crashes despite i am working on a 1 To RAM server:

joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

here is my setup:

Requirement already satisfied: InSilicoSeq in /home/matalb01/miniconda3/lib/python3.7/site-packages (1.5.4)
Requirement already satisfied: joblib in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (1.1.0)
Requirement already satisfied: pysam>=0.15.1 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (0.16.0.1)
Requirement already satisfied: requests in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (2.28.1)
Requirement already satisfied: biopython<=1.78 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (1.78)
Requirement already satisfied: numpy in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (1.21.6)
Requirement already satisfied: future in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (0.18.2)
Requirement already satisfied: scipy in /home/matalb01/miniconda3/lib/python3.7/site-packages (from InSilicoSeq) (1.7.3)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from requests->InSilicoSeq) (2.1.0)
Requirement already satisfied: idna<4,>=2.5 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from requests->InSilicoSeq) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from requests->InSilicoSeq)
(2022.6.15)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/matalb01/miniconda3/lib/python3.7/site-packages (from requests->InSilicoSeq) (1.26.11)

novitch · 2022-09-08T21:52:22Z

Any ideas please ?

Bartosz-Lewandowski · 2024-02-17T09:08:49Z

Hi,
I'm facing the same issue. Even using machine with 252 GB RAM it can't calculate sim reads for a 18 chromosomes.
I'm using this command to run:
iss generate --model {self.model} --genomes {self.fasta_file} --n_reads {N} --cpus {self.cpu} --output {self.pathout}/{self.cov}

I'm wondering if there's a way to clear memory with some kind of garbage collector between chromosomes. I notice that it creates files for each chromosome, and when generating reads for the next record, RAM usage doesn't decrease.

Alternatively, is there a way to generate reads for each chromosome separately and then combine them?

All my code is stored there:
https://github.com/Bartosz-Lewandowski/DCNVProd
Feel free to check it out!

ThijsMaas · 2024-02-17T09:21:55Z

Hi! Did you try the latest release? The new 2.0.0 version has a complete rework of the multiprocessing pipeline which includes a memory leaking fix.

Bartosz-Lewandowski · 2024-02-17T09:36:19Z

That sounds great! I haven't tried the latest release yet, but I'll definitely give it a go!

HadrienG self-assigned this Jul 28, 2021

HadrienG added the bug label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leakage when using abundance file #208

Memory leakage when using abundance file #208

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021 •

edited

Loading

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021 •

edited

Loading

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 14, 2021

cimendes commented Jul 14, 2021

HadrienG commented Jul 28, 2021

novitch commented Sep 8, 2022

novitch commented Sep 8, 2022

Bartosz-Lewandowski commented Feb 17, 2024

ThijsMaas commented Feb 17, 2024 •

edited

Loading

Bartosz-Lewandowski commented Feb 17, 2024 •

edited

Loading

Memory leakage when using abundance file #208

Memory leakage when using abundance file #208

Comments

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021 • edited Loading

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021 • edited Loading

cimendes commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 13, 2021

andersgs commented Jul 14, 2021

cimendes commented Jul 14, 2021

HadrienG commented Jul 28, 2021

novitch commented Sep 8, 2022

novitch commented Sep 8, 2022

Bartosz-Lewandowski commented Feb 17, 2024

ThijsMaas commented Feb 17, 2024 • edited Loading

Bartosz-Lewandowski commented Feb 17, 2024 • edited Loading

andersgs commented Jul 13, 2021 •

edited

Loading

andersgs commented Jul 13, 2021 •

edited

Loading

ThijsMaas commented Feb 17, 2024 •

edited

Loading

Bartosz-Lewandowski commented Feb 17, 2024 •

edited

Loading