Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC supporting Dask workflows #823

Open
tiborsimko opened this issue Aug 21, 2024 · 0 comments
Open

RFC supporting Dask workflows #823

tiborsimko opened this issue Aug 21, 2024 · 0 comments

Comments

@tiborsimko
Copy link
Member

About

We are starting a new sprint to support the running of Dask workflows in REANA. Please let us know what you think and please share any other desiderata you may have. Thanks for your input!

Goals

The main goals are:

  • Allow using Dask library to define, launch and orchestrate data analysis jobs, instead of or in addition to using declarative workflow languages such as Snakemake.

  • Allow reinterpreting a Dask-based analysis in the future, i.e. bring up a desired Dask cluster version and image for the analysis reuse before its reexecution and tear it down afterwards.

Use cases

  1. As a researcher,
    I would like to bring up a Dask cluster for my workflow runs,
    so that I can use Dask task graphs in my analyses.

  2. As a researcher,
    I would like to use a particular Dask cluster version and the Dask worker image,
    so that I can ensure my analysis can be reinterpreted correctly even several years later.

  3. As a researcher,
    I would like to configure the amount of necessary Dask resources such as CPU and RAM,
    so that my analysis can be run efficiently.

  4. As a researcher,
    I would like to mount my REANA secrets alongside Dask jobs,
    so that I can profit from the usual Kerberos or VOMS authentications to access remote resources.

  5. As a researcher,
    I would like to see the logs of my Dask jobs in the regular REANA logging system,
    so that I can be informed about the workflow progress or errors in the usual manner.

  6. As a researcher,
    I would like to list all my workflows using Dask clusters and their statuses,
    so that I can make sure that I have not left behind anything unnecessary.

  7. As a cluster administrator,
    I would like to specify the list of vetted (allowed and recommended) images to be used for Dask workflows,
    so that the cluster says safe from running possibly vulnerable images.

  8. As a cluster administrator,
    I would like to inspect who is using which Dask cluster,
    so that I can quickly get in touch with researchers in case of problems.

  9. As a cluster administrator,
    I would like to configure various Dask resource limits for users,
    so that workflows asking for exaggerated resources can be filtered early.

  10. As a cluster administrator,
    I would like to benefit from the auto-scaling features during user workflow execution,
    so that my cluster uses resources only when really necessary.

Discussion

dask

User configuration

If using one Dask cluster for the entire analysis is sufficient for the analysis, the reana.yaml could look like:

inputs:
  files:
    - myanalysis.py
workflow:
  type: serial
  resources:
    dask:
      version: 2023.1.0
      cores: 16
      memory: "96 GB"
  specification:
    steps:
      - name: mystep
        environment: docker.io/mygroup/myenvironment:2.13.4
        commands:
        - python myanalysis.py  # uses DASK_SCHEDULER_URI=tcp://10.96.4.51:8786
outputs:
  files:
    - myplot.png

See also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

No branches or pull requests

1 participant