You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are starting a new sprint to support the running of Dask workflows in REANA. Please let us know what you think and please share any other desiderata you may have. Thanks for your input!
Goals
The main goals are:
Allow using Dask library to define, launch and orchestrate data analysis jobs, instead of or in addition to using declarative workflow languages such as Snakemake.
Allow reinterpreting a Dask-based analysis in the future, i.e. bring up a desired Dask cluster version and image for the analysis reuse before its reexecution and tear it down afterwards.
Use cases
As a researcher,
I would like to bring up a Dask cluster for my workflow runs,
so that I can use Dask task graphs in my analyses.
As a researcher,
I would like to use a particular Dask cluster version and the Dask worker image,
so that I can ensure my analysis can be reinterpreted correctly even several years later.
As a researcher,
I would like to configure the amount of necessary Dask resources such as CPU and RAM,
so that my analysis can be run efficiently.
As a researcher,
I would like to mount my REANA secrets alongside Dask jobs,
so that I can profit from the usual Kerberos or VOMS authentications to access remote resources.
As a researcher,
I would like to see the logs of my Dask jobs in the regular REANA logging system,
so that I can be informed about the workflow progress or errors in the usual manner.
As a researcher,
I would like to list all my workflows using Dask clusters and their statuses,
so that I can make sure that I have not left behind anything unnecessary.
As a cluster administrator,
I would like to specify the list of vetted (allowed and recommended) images to be used for Dask workflows,
so that the cluster says safe from running possibly vulnerable images.
As a cluster administrator,
I would like to inspect who is using which Dask cluster,
so that I can quickly get in touch with researchers in case of problems.
As a cluster administrator,
I would like to configure various Dask resource limits for users,
so that workflows asking for exaggerated resources can be filtered early.
As a cluster administrator,
I would like to benefit from the auto-scaling features during user workflow execution,
so that my cluster uses resources only when really necessary.
Discussion
User configuration
If using one Dask cluster for the entire analysis is sufficient for the analysis, the reana.yaml could look like:
About
We are starting a new sprint to support the running of Dask workflows in REANA. Please let us know what you think and please share any other desiderata you may have. Thanks for your input!
Goals
The main goals are:
Allow using Dask library to define, launch and orchestrate data analysis jobs, instead of or in addition to using declarative workflow languages such as Snakemake.
Allow reinterpreting a Dask-based analysis in the future, i.e. bring up a desired Dask cluster version and image for the analysis reuse before its reexecution and tear it down afterwards.
Use cases
As a researcher,
I would like to bring up a Dask cluster for my workflow runs,
so that I can use Dask task graphs in my analyses.
As a researcher,
I would like to use a particular Dask cluster version and the Dask worker image,
so that I can ensure my analysis can be reinterpreted correctly even several years later.
As a researcher,
I would like to configure the amount of necessary Dask resources such as CPU and RAM,
so that my analysis can be run efficiently.
As a researcher,
I would like to mount my REANA secrets alongside Dask jobs,
so that I can profit from the usual Kerberos or VOMS authentications to access remote resources.
As a researcher,
I would like to see the logs of my Dask jobs in the regular REANA logging system,
so that I can be informed about the workflow progress or errors in the usual manner.
As a researcher,
I would like to list all my workflows using Dask clusters and their statuses,
so that I can make sure that I have not left behind anything unnecessary.
As a cluster administrator,
I would like to specify the list of vetted (allowed and recommended) images to be used for Dask workflows,
so that the cluster says safe from running possibly vulnerable images.
As a cluster administrator,
I would like to inspect who is using which Dask cluster,
so that I can quickly get in touch with researchers in case of problems.
As a cluster administrator,
I would like to configure various Dask resource limits for users,
so that workflows asking for exaggerated resources can be filtered early.
As a cluster administrator,
I would like to benefit from the auto-scaling features during user workflow execution,
so that my cluster uses resources only when really necessary.
Discussion
User configuration
If using one Dask cluster for the entire analysis is sufficient for the analysis, the
reana.yaml
could look like:See also
The text was updated successfully, but these errors were encountered: