Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job cleanup on agents #355

Merged
merged 2 commits into from
Jan 8, 2024

Conversation

Kuba314
Copy link
Contributor

@Kuba314 Kuba314 commented Jan 2, 2024

Description

See commit message bodies for more info.

Tests

Normal SimpleNetworkRecipe execution passes, interrupting with ^C during iperf test also finishes well and doesn't crash agents. Not sure if this is enough testing though...

Closes: #354

Previously, `JobContext.cleanup()` would kill all jobs and clear its
dict. However, the killing of the jobs generates "job_finished" events,
which were previously trying to get the finished job object from job
context's dict and trying to join the job, but the dict was empty at
this point.

By not clearing the job dict and popping the job from the dict in the
"job_finished" handler (instead of just getting it) we shouldn't run
into this problem anymore.
Since `cleanup()` is now just calling `_kill_all_jobs()`, there's no
reason to leave both of these methods.

I decided to rename the "public" API from `cleanup()` to
`kill_all_jobs()` because I think that this name better describes what
the method actually does.
Copy link
Collaborator

@olichtne olichtne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me, did you check if the cleanup method was used anywhere else ? this is a public RPC API method so it could be getting called from the Controller as well, haven't checked myself...

@Kuba314
Copy link
Contributor Author

Kuba314 commented Jan 5, 2024

@olichtne The cleanup() method was on JobContext, which is only used in RemoteMethods and cannot actually be called via RPC, or have I misunderstood?

Btw, https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Agent.py#L1059 get_cmd and del_cmd don't actually exist anymore, so the whole "exception" handler can be removed? Which is weird because we are probably sometimes sending "exception" messages at:

@olichtne
Copy link
Collaborator

olichtne commented Jan 5, 2024

@olichtne The cleanup() method was on JobContext, which is only used in RemoteMethods and cannot actually be called via RPC, or have I misunderstood?

ah, you're right, i misread that...


Btw, https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Agent.py#L1059 get_cmd and del_cmd don't actually exist anymore, so the whole "exception" handler can be removed? Which is weird because we are probably sometimes sending "exception" messages at:

so there are 2 exception type messages:

  1. what you've linked is exception messages that are provided to send_data_to_ctl method, this sends a message from an Agent process directly to the Controller process
  2. there are exception messages from Jobs: https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L114 this is sent from a Forked Agent process running a Job._run method to the main Agent process... there it should be handled by the exception message handler you linked.

So I think this should be kept as this has a different, still existing use case.

However, I'm not sure why the get_cmd and del_cmd method calls are still there and why we haven't seen any problems with them recently.

This should be testable by runing a test module linke this:

class MyTestModule(BaseTestModule):
    def run():
        sleep(2) # simulate runtime
        raise Exception("exception from Job process")

@Kuba314
Copy link
Contributor Author

Kuba314 commented Jan 5, 2024

@olichtne I'm seeing the following in host1's DEBUG output:

    Traceback (most recent call last):
      File "/lnst/lnst/Agent/Job.py", line 263, in run
        self._result["passed"] = self._what["module"].run()
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/.cache/pypoetry/virtualenvs/lnst-tHnIm-F_-py3.12/bin/cache/1e90595123b49a02d0c371edd1d0e9c9d778da4db94c6657ddc1c4484b1e255e", line 12, in run
        raise Exception("exception from Job process")
    Exception: exception from Job process

This is where the test module gets run: https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L262. Any exception gets caught right here and its traceback is printed to agent's debug. There's however another try-catch which I don't think will ever trigger at https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L110.

@olichtne
Copy link
Collaborator

olichtne commented Jan 8, 2024

got it. makes sense... i'd still keep these except branch blocks and the exception message handling in the Agent for now. It may look useless right now but i'm not 100% sure and i think it at least doesn't have to be part of this PR.

@Kuba314 anything else that you want to do for this PR since you're keeping it in Draft status? or can we proceed with merging this in?

@Kuba314
Copy link
Contributor Author

Kuba314 commented Jan 8, 2024

Oops, no, sorry. It was just a proposed fix at the time of creation but now it's tested and approved by you so I'm marking it ready.

@Kuba314 Kuba314 marked this pull request as ready for review January 8, 2024 14:11
@olichtne olichtne merged commit 18993c1 into LNST-project:master Jan 8, 2024
3 checks passed
@Kuba314 Kuba314 deleted the job-context-fix-cleanup branch January 17, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lnst-agent crashes on controller KeyboardInterrupt during Iperf
2 participants