Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPPIC 3 pipeline using Docker/Luigi #2

Open
sbliven opened this issue Jun 24, 2016 · 1 comment
Open

EPPIC 3 pipeline using Docker/Luigi #2

sbliven opened this issue Jun 24, 2016 · 1 comment

Comments

@sbliven
Copy link
Member

sbliven commented Jun 24, 2016

The current pipeline will need to be modified a fair bit to account for changes between EPPIC 2 and 3. EPPIC 3 differs in config file location, parameters, output directory layout, and jetty setup, to name a few differences. This makes it an excellent time to make some changes to the pipeline scripts.

Desirable features include

  • Single configuration file for the whole pipeline. Scripts should never require editing.
  • Isolate parallel instances. Multiple runs should never overwrite each other.
  • Workflows for pipeline, topup, and local development runs should share scripts where they overlap and should have related methods for running.
  • Easy installation & configuration. Setting up a new developer or a new computer should require little more than checking the pipeline out of git and installing a short list of prerequisites.
  • The system should work across hosts (dev, production, merlin) with as little user interaction as possible
  • Step-wise translation from existing infrastructure

After researching it a bit, I think that docker & luigi satisfy these requirements pretty well. This combo has been used by other bioinformatics pipelines (e.g. medium

Docker provides encapsulation of executables with basically no overhead. All dependencies and setup are specified in a file and are completely reproducible. It does require a daemon to be installed on the computer, so it wouldn't work on Merlin unless AIT supports it.

I think docker would be great for running the WUI, since all config files can be included at the right location automatically. It is also common to run mysql in a docker, so it doesn't need to be installed e.g. for new developers.

Luigi is a light-weight workflow manager. You define computational tasks as short python classes (meshes well with Kumaran's python scripts), which define prerequisites and outputs. Luigi takes care of running the dependencies. It supports ssh and remote execution too, so we can automate the full pipeline including data transfers. It can be installed as a simple python dependency (so Merlin integration should be OK), or can be run with a daemon to provide a nice progress-tracking website and visualizations. All tasks can be easily configured using text files.

I'm thinking that we could migrate scripts step-wise, starting from the WUI where I'm doing the most development now. Most existing scripts can just be called using python's subprocess, perhaps after replacing hard-coded parameters with inputs.

@josemduarte
Copy link
Contributor

I totally agree that we need a better pipeline with all the features you describe.

In terms of software choices, I would totally support Luigi as the pipeline method of choice. Incidentally we have started using it here for the mmtf project and we can benefit from mutual experiences.

Regarding docker, I would be more reluctant to use it unless we see our setup is so complicated that we can't do without it. Why I'd like to avoid it:

  • Adds an unnecessary layer of complication
  • Needs to be installed as daemon: potential problems in Scientific Linux (or other exotic distros) and especially a problem in Merlin (need sysadmins to agree and so on)
  • I see docker is very popular in the python world because dependencies are a real nightmare to deal with... We don't have that problem in our case thanks to maven. So that advantage wouldn't apply to us
  • In the end of the day I don't see we have so many things that can't live together in same box/VM:
    • Configuration: at the moment partly hard-coded, but that's easy to solve with a few easy changes (see Add config file/directory path to all eppic executables eppic#131)
    • Jetty configuration: in my opinion it'd be better to use the built-in jetty capabilities of serving several apps within the same jetty server instance. It might take us some time to find out how to do it for both eppic 2 and 3, but it gives us a lot of control on what we can do with it. Otherwise simply using 2 jetty instances is not a bad solution. With docker you would end up with 2 jetty instances too, only totally isolated from each other. But I don't see there's much possibility for the 2 instances to clash.

It would be good to discuss this in a skype session some time soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants