Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

info singularity repo build #96

Open
KristinaGagalova opened this issue Oct 21, 2024 · 3 comments
Open

info singularity repo build #96

KristinaGagalova opened this issue Oct 21, 2024 · 3 comments

Comments

@KristinaGagalova
Copy link
Member

KristinaGagalova commented Oct 21, 2024

Hi Darcy,

Thank you for maintaining predector. We plan to incorporate it in our MycoProcessor pipeline at the CCDM and make it reusable as a subworflow under a broader toolkit. Unfortunately, Predector, as it is, is unsuitable for integration into an existing subworkflow, especially for the installation of software, which we manage through individual containers.

Also, the current predector container creation uses Docker as a base and builds singularity through that. A possible solution would be to create an independent container, excluding the proprietary software, and share it on DockerHub. We can then better incorporate it into our pipeline.

But if you have a suggestion or another idea, I would be happy to hear them!

Cheers
Kristina

@darcyabjones
Copy link
Member

Hi Kristina,

James said you were doing something along those lines. Sounds like it'll be useful :)

I don't mind If you'd like to restructure things to make it easier to reuse in your pipeline.
I think you should already have permission as part of the CCDM group.
Just please keep it in a separate branch until we can check it doesn't cause any issues.

It's been a while since i've written anything new in nextflow, but i know there's some new syntax for dealing with imports/workflows and I think the new modules syntax would probably be good to adopt.

Regarding Docker etc.
I don't quite understand what you're proposing to change here.
I'll try to address some points you've made but maybe you could give a concrete example?
Or describe how you're managing software/containers?

The current setup is really all based on conda.
The docker container just sets up a conda environment (in environment.yml) inside it.
You could create the conda environment in singularity/apptainer just as easily, but then i can't reuse that to create a docker container.
It makes it easier to keep things consistent between the different environments.
At the time we wrote this it was the recommended way in nfcore, maybe that's changed.

RE: A possible solution would be to create an independent container
We actually do this, when I publish a new release I push a container "predector-base" to Dockerhub (https://hub.docker.com/r/predector/predector-base) which contains everything except the proprietary stuff.
When you run install.sh it just builds a new container based on this including the provided tarballs (exactly like the conda/mamba method does).
Both the docker and singularity installs use this method.

RE: we manage through individual containers
It should be possible to run the pipeline with individual containers already with a config file.
Instead of using -profile docker etc, you just create a new config and use the process selectors to select the labels and set individual docker containers. But then some processes will need predector-utils and/or GNU parallel, so you'd need custom containers for them.
We could possible refactor some things to separate the predector-utils stuff into another process and use xargs instead of parallel... I'd have to think about it. It makes things a bit more difficult.

I've found in the past that dealing with dozens of individual containers (e.g. in PanAnn/TE) was kind of a pain.
It also makes copying the singularity containers with proprietary software up to supercomputers a bit harder.
You'd either have to copy everything, or specify both local and remote paths to containers, which the config syntax made tricky (maybe something has changed?).
But i guess trying to make all of the software play nicely with one another is a pain too.
I'm hesitant to convert Predector, but maybe i'm missing something.

Happy to talk more through any ideas you have.
I'm pretty busy for the next few weeks, but after that things will free up a bit :)

All the best,
Darcy

PS. Apologies for the essay :)

@KristinaGagalova
Copy link
Member Author

KristinaGagalova commented Oct 21, 2024

HI Darcy

Thank you for getting back to me!

This is the way our tools are structured: it has a workflow-subworkflow-module structure organization which relies on a container download from each process module. I don't know if that was a feature of nf when you coded it, but it's quite handy it since it's handled automatically by the pipeline.

This process structure downloads and uses a container inside the pipeline:

process TEST_PROCESS {

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'quay.io/biocontainers/conatainername:kit':
        null }"

    input:
....

You don't need to download it manually and it works well in complex pipelines, without the requirement to download the steps individually.

We also tried to use Conda, but it was slow and painful. Using the environment.yaml file is smart and looks like the best way to build docker/singularity images. I have used that before, and it works fine.

As for your suggestions, I think this one looks the most promising:

RE: A possible solution would be to create an independent container
We actually do this, when I publish a new release I push a container "predector-base" to Dockerhub (https://hub.docker.com/r/predector/predector-base) which contains everything except the proprietary stuff.
When you run install.sh it just builds a new container based on this including the provided tarballs (exactly like the conda/mamba method does).
Both the docker and singularity installs use this method.

Would that work if I had the licensed software somewhere locally and I pointed to their path? We already use SignalP in this way in one of the pipelines and it works ok. What do you think?

If that won't work, I may need to do some coding on predector, which I can add to a separate branch here. I am unsure how that would be updated/merged if there were new releases. The best would be to use it as it is and call it in our pipeline, but I will know if that may work out with the solution before.

@darcyabjones
Copy link
Member

Quick reply:

RE: container download from each process module.

Yeah this was always a feature.
The issue was really with the proprietary software, since you can't set it to automatically download them.
You have to manually set the paths to locally built containers (which then also has to handle all of the ways that users could possibly mix things up).
Or worse you'd have to rely on users installing the software themselves onto their local computers and have regular questions about for all of the ways that can go wrong.
It also made post-processing results easier, reduced the space required for containers since a lot of it is redundant etc.

RE: ** Conda, but it was slow and painful.**
Mamba makes it much faster these days if you haven't come across it yet.
The only trouble i have with it these days is if they remove an old version of some package and it shifts my dependencies in unexpected ways, which it usually not so hard to fix.

RE: Would that work if I had the licensed software somewhere locally and I pointed to their path?
I'm assuming that you mean "pointed to their path" when you run the pipeline, rather than when you build the second container with install.sh?
Currently no. It's something that i'm considering, but it does require modifications and some thought.

My thinking is that rather than installing inside the container, i just untar the package and modify it in a process, and then we can pass the folder to later processes. That way it can still run inside controlled containers but maybe reduce some initial friction. The only real issue is managing the dependencies. e.g. SignalP6 will still need some kind of environment to provide pytorch etc. I think some of them don't work without running the installer so it would need some workarounds to set things up properly.
It does also mean that you'd have to provide them all whenever you run the pipeline, which aesthetically i don't love, but i could get over that.

I often feel like writing Signal peptide prediction tools just to spite the DTU and their frustrating license requirements.

RE: I am unsure how that would be updated/merged if there were new releases.
We can find a way to make it compatible, but it's easy enough to maintain parallel branches with selective git merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants