Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OptiType fails sometimes with BAM not found #419

Closed
tavinathanson opened this issue Feb 9, 2017 · 26 comments
Closed

OptiType fails sometimes with BAM not found #419

tavinathanson opened this issue Feb 9, 2017 · 26 comments
Labels

Comments

@tavinathanson
Copy link
Member

Trying with @armish's setup (since mine didn't work; see #418), I get some of these, which I believe are issues with OptiType itself:

### Kube-Job 5ae7ce4b-d459-5b3c-a30d-6cd9528d6291
### Freshness: Fresh
### Output:

User
biokepi
Host
5ae7ce4b-d459-5b3c-a30d-6cd9528d6291
Machine
Linux 5ae7ce4b-d459-5b3c-a30d-6cd9528d6291 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 x86_64 x86_64 GNU/Linux
biokepi
biokepi
No export var
/tmp/_MEIq1H09H/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
Killed
Killed

0:00:00.47 Mapping f003a1843a6c739551bcfd981af8afd7_checkpoint-trials_lung_tumor_bams_mafs_SN0110394_bams_mafs_old_samples_IN_MCC_00234_T1_bamIN_MCC_00234_T1-b2fq-PE_R1.fastq to GEN reference...

0:14:11.61 Mapping f003a1843a6c739551bcfd981af8afd7_checkpoint-trials_lung_tumor_bams_mafs_SN0110394_bams_mafs_old_samples_IN_MCC_00234_T1_bamIN_MCC_00234_T1-b2fq-PE_R2.fastq to GEN reference...

0:27:58.74 Generating binary hit matrix.
Traceback (most recent call last):
  File "<string>", line 267, in <module>
  File "hlatyper.py", line 177, in pysam_to_hdf
  File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
  File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `tumor_dna_processing_IN_MCC_00234/2017_02_09_10_38_37/2017_02_09_10_38_37_1.bam` not found
OptiTypePipeline returned -1

Digging a little deeper, I noticed:

@tavinathanson
Copy link
Member Author

So I don't think it's related to getting BAMs as input, since it's not following that code path.

Rather, it appears to do this:

https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L286

Then:

https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L294

Then, I think it fails at:

https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L298

@tavinathanson
Copy link
Member Author

For whatever reason, it looks like https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L288 didn't result in a BAM being created?

I also see that the BAMs get removed when done, which explains why the other successes don't have BAMs there.

@tavinathanson
Copy link
Member Author

This is a dup. of what @armish hit in RCC: https://github.com/hammerlab/rcc-analyses/issues/104

Leaving it open since this is the more general repo.

@tavinathanson
Copy link
Member Author

Tried running this manually in the VM. Some more information:

Aborted (core dumped)

0:09:19.04 Mapping 0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R2.fastq to GEN reference...
/nfs-pool/biokepi/toolkit/biopam-kit/opam_dir/opam-root-root-optitype.1.0.0/0.0.0/build/seqan.2.1.0/include/seqan/basic/basic_exception.h:363 FAILED!  (Uncaught exception of type std::bad_alloc: std::bad_alloc)

stack trace:
  0                      [0x72c93d]
  1                      [0x75a146]
  2                      [0x75a191]
  3                      [0x75b149]
  4                      [0x74f99c]
  5                      [0x4091c4]
  6                      [0x47846a]
  7                      [0x4d2bb2]
  8                      [0x72bc38]
  9                      [0x401b23]
 10                      [0x810dc6]
 11                      [0x810fba]
 12                      [0x404cd9]

Aborted (core dumped)

0:18:12.60 Generating binary hit matrix.
Traceback (most recent call last):
  File "<string>", line 267, in <module>
  File "hlatyper.py", line 177, in pysam_to_hdf
  File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
  File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `tumor_dna_processing_AG538184-7/2017_02_10_16_29_59/2017_02_10_16_29_59_1.bam` not found
OptiTypePipeline returned -1
(/nfs-pool/biokepi//toolkit/biopam-kit/envs/optitype.1.0.0) opam@115e92c3b8c5:/nfs-pool-16/biokepi/work/results-b37decoy-tumor_dna_processing_AG538184-7/a8365d6f969a40ef3f6fa69c0a56ed62tumor_dna_processing_AG538184-7DNA0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R1_fastqoptitype.d$

@tavinathanson
Copy link
Member Author

Seems like an OOM situation. Looks like the 10 that failed, at first glance, were relatively large FASTQs; this might be relevant. seqan/seqan#1276

@ihodes ihodes added the bug label Feb 10, 2017
@ihodes
Copy link
Member

ihodes commented Feb 10, 2017

You can try remaking the cluster with bigger nodes? Or is there an argument you can pass to Optitype that tells it to use all 52GB of the default nodes?

@tavinathanson
Copy link
Member Author

@ihodes first trying manually on a beefed up node; but if that works, how do I remake the cluster with bigger nodes?

@ihodes
Copy link
Member

ihodes commented Feb 10, 2017

I'm not sure to be honest; you might be able to change it from the GCloud GKE interface, or you could take down the cluster you have and start a new one with different node type… @smondet do you know?

@smondet
Copy link
Member

smondet commented Feb 10, 2017

@ihodes I've never tried to change the machine type "live"

The machine-type is an option of coclobas configure ... and then each job requests some amount of CPUs/Memory; so the Biokepi.Machine.t has to ask for more also in its run_program
(right now we use the defaults everywhere)

@tavinathanson
Copy link
Member Author

Confirmed that this is a memory issue: when running the same commands manually on 30GB memory vs. 120GB memory, it fails on the former and succeeds on the latter.

@ihodes
Copy link
Member

ihodes commented Feb 14, 2017

Do we know if we can filter reads to the MHC locus and save a lot of space? If so, we should add this filtering step to the pipeline in Biokepi

@tavinathanson
Copy link
Member Author

@ihodes see #423; I don't think that would address these memory issues, because that filtering would be via razerS3, which is also where the OOM is within OptiType.

@ihodes
Copy link
Member

ihodes commented Feb 14, 2017

Fair enough; I wonder if we could use BWA-mem to do this filtering instead?

@tavinathanson
Copy link
Member Author

@ihodes probably, though it's not OptiType's recommendation:

You can use any read mapper to do this step, although we suggest you use RazerS3. Its only drawback is that due to way RazerS3 was designed, it loads all reads into memory, which could be a problem on older, low-memory computing nodes.

@tavinathanson
Copy link
Member Author

Per @smondet's instructions, I ran on larger cluster nodes as follows:

# Ctrl-C in the Coclobas-server screen tab
coclobas cluster delete --root /coclo/_cocloroot/
coclobas configure --root _cocloroot/ --cluster-name $CLUSTER_NAME --cluster-zone $GCLOUD_ZONE --max-nodes $CLUSTER_MAX_NODES --machine-type n1-standard-32
screen -t Coclobas-server coclobas start-server --root _cocloroot/ --port 8082 # Don't use start-all; this will overwrite the coclobas configure command

Replaced my biokepi_machine.ml with his new one, which adds support for customizing CPU/memory limits: https://github.com/hammerlab/coclobas/blob/f690ab74f1ce88ccb75d047c87e7f4eb314f7ba7/tools/docker/biokepi_machine.ml

And then:

export KUBE_JOB_CPUS=32
export KUBE_JOB_MEMORY=118

Confirmed that my GCP instance group had the right node type. Then re-ran my jobs.

We'll see if that works!

@tavinathanson
Copy link
Member Author

Success!

@tavinathanson
Copy link
Member Author

Spoke too soon. 1 out of the 9 remaining jobs still failed with the same error :(.

@tavinathanson tavinathanson reopened this Feb 15, 2017
@ihodes
Copy link
Member

ihodes commented Feb 15, 2017 via email

@tavinathanson
Copy link
Member Author

@ihodes it's 81GB, which I didn't think was particularly larger, but I could be misremembering.

@tavinathanson
Copy link
Member Author

@ihodes I was wrong; it is the largest one. Sigh. At least the problem is clear, but I'm becoming more convinced by your suggestion to filter using non-razerS3.

@ihodes
Copy link
Member

ihodes commented Feb 15, 2017

It may be the only way forward… or you switch to 250+GB machines for extremely expensive runs. hammerlab/coclobas#19 will also help with degenerate cases like these in the future.

@ihodes ihodes closed this as completed Feb 15, 2017
@tavinathanson
Copy link
Member Author

tavinathanson commented Feb 15, 2017

@ihodes yeah already kicked off a 208GB machine run. Let's see if that works.

@tavinathanson tavinathanson reopened this Feb 15, 2017
@tavinathanson
Copy link
Member Author

It worked!

@maryawood
Copy link

I've been experiencing the same error that @tavinathanson described here with a set of files I'm working with, but it doesn't appear to be an issue with memory - requesting a machine with increased memory doesn't not eliminate the problem, and I've been able to run Optitype without error on larger fastq files from a different dataset without this problem. Further, when I try to run the razerS command on the command line, it doesn't return an error, but still doesn't produce a bam file.

I'm at a bit of a loss for what to do. Any ideas as to what the problem may be?

@armish
Copy link
Member

armish commented Jul 24, 2017

@maryawood: unfortunately still sounds like a memory issue or something related to it. Depending on the depth/coverage of your sequencing data, the memory requirements for razer3 can go through the roof and since this is related to the way razer3 keeps the data in the memory, there is very little you can do.

I have been experimenting with different approaches and I found that using bwa mem to filter down the reads both makes the pipeline run much faster and with very little memory footprint; and testing this approach over a largish cohort of patients (~100), I found that bwa-pre-filtering doesn't really bias or affect the result in any way.

Here is the modified pipeline:

  • Index the HLA reference file that comes with the optitype: e.g. bwa index $OPTITYPE_HOME/data/hla_reference_dna
  • Map your reads against this reference sequence and filter out all reads that do not map (F4). You can do this for each pair individually: e.g. bwa mem $OPTITYPE_HOME/data/hla_reference_dna your.pair1.fastq | samtools fastq -F4 - filtered.hla.pair1.fastq
  • Run the standard Optitype pipeline using these two new fastqs as input
  • Viola!

@maryawood
Copy link

@armish thanks so much for the suggestion! I will give this a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants