Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return scheduler job ids from FlowProject.submit #543

Open
bdice opened this issue Jun 15, 2021 · 1 comment
Open

Return scheduler job ids from FlowProject.submit #543

bdice opened this issue Jun 15, 2021 · 1 comment
Labels
cluster submission Enhancements to the submission process enhancement New feature or request good first issue Good for newcomers

Comments

@bdice
Copy link
Member

bdice commented Jun 15, 2021

Feature description

Requested by user @salazardetroya: https://signac.slack.com/archives/CVC04S9TN/p1623794700095400

Whenever I submit a job with sbatch ... I typically obtain the job ID as output. I'd like to obtain that job ID using signac-flow.

This would enable complex submission workflows through something like the following snippet:

from flow import FlowProject

class Project(FlowProject):
    pass

project = Project()
scheduler_job_ids = project.submit(...)

# Wait until the last one of the previous jobs have completed
more_job_ids = project.submit(..., after=scheduler_jobs_ids[-1])

Proposed solution

We used to (partially) support this kind of behavior for PBS/Torque clusters but we did not implement it for SLURM. If we chose to support this feature, we would need to implement it for all schedulers so that we have a consistent API. See here for the past implementation (removed in 0.12):

try:
output = subprocess.check_output(
submit_cmd + [tmp_submit_script.name])
jobsid = output.decode('utf-8').strip()
except subprocess.CalledProcessError as e:
raise SubmitError("qsub error: {}".format(e.output()))
return jobsid

I believe that one possible issue with this approach is that I'm not sure if all clusters behave the same. Some clusters might print other messages / info via stdout / stderr that would break the parsing.

The return value of the scheduler class (the part I linked above) would need to be forwarded through a series of calling functions to the return value of FlowProject.submit. I think it might be appropriate to return a list of job ids as strings, since FlowProject.submit can call sbatch (or a different scheduler command) multiple times.

To add this feature, here are the steps I would suggest:

  1. Make the internal function _call_submit return the captured output. This applies to all schedulers.
    subprocess.check_output(
  2. Parse the output and extract the scheduler job id according to each scheduler class. Here's the line to edit for the SLURM scheduler. Ask for help if you need someone else to test schedulers for which you don't have access to a test cluster.
    return _call_submit(submit_cmd, script, pretend)
  3. Change the behavior of the ComputeEnvironment class to pass through the captured scheduler job id if submission occurs (instead of JobStatus.submitted, which could be inferred by the calling functions) and None if submission didn't run or failed.
    if cls.get_scheduler().submit(script, flags=flags, *args, **kwargs):
    return JobStatus.submitted
    return None
  4. Refactor FlowProject._submit_operations to pass through scheduler job ids, just like in the previous step.

    signac-flow/flow/project.py

    Lines 3691 to 3693 in 9d4f1b4

    return self._environment.submit(
    _id=_id, script=script, flags=flags, **kwargs
    )
  5. Finally, change the behavior of FlowProject.submit to return job ids (and continue to update the job/operation status on success, as interpreted by the result of the above method calls).

    signac-flow/flow/project.py

    Lines 3782 to 3791 in 9d4f1b4

    status = self._submit_operations(
    operations=bundle,
    parallel=parallel,
    force=force,
    **kwargs,
    )
    if status is not None:
    # Operations were submitted, store status
    for operation in bundle:
    status_update[operation.id] = status
  6. Test on a system with a scheduler.
  7. Update docs.
  8. Decide whether the FlowProject CLI (python project.py submit) should print the ids returned by the FlowProject.submit method.

Additional context

Another alternative would be to just return the raw captured stdout and leave it to the user to parse that information. In that case, FlowProject.submit would return a list of strings, each containing the raw output of one call to sbatch (instead of a list of strings of parsed job ids).

@bdice bdice added cluster submission Enhancements to the submission process enhancement New feature or request labels Jun 15, 2021
@vyasr
Copy link
Contributor

vyasr commented Jun 17, 2021

#3 is partially related

@kidrahahjo kidrahahjo added this to the v0.16.0 milestone Jun 24, 2021
@bdice bdice removed this from the v0.16.0 milestone Aug 16, 2021
@kidrahahjo kidrahahjo added the good first issue Good for newcomers label Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster submission Enhancements to the submission process enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants