-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix job cleanup on agents #355
Conversation
Previously, `JobContext.cleanup()` would kill all jobs and clear its dict. However, the killing of the jobs generates "job_finished" events, which were previously trying to get the finished job object from job context's dict and trying to join the job, but the dict was empty at this point. By not clearing the job dict and popping the job from the dict in the "job_finished" handler (instead of just getting it) we shouldn't run into this problem anymore.
Since `cleanup()` is now just calling `_kill_all_jobs()`, there's no reason to leave both of these methods. I decided to rename the "public" API from `cleanup()` to `kill_all_jobs()` because I think that this name better describes what the method actually does.
b73abb7
to
7522b94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok to me, did you check if the cleanup
method was used anywhere else ? this is a public RPC API method so it could be getting called from the Controller as well, haven't checked myself...
@olichtne The Btw, https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Agent.py#L1059 |
ah, you're right, i misread that...
so there are 2 exception type messages:
So I think this should be kept as this has a different, still existing use case. However, I'm not sure why the This should be testable by runing a test module linke this:
|
@olichtne I'm seeing the following in host1's DEBUG output:
This is where the test module gets run: https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L262. Any exception gets caught right here and its traceback is printed to agent's debug. There's however another try-catch which I don't think will ever trigger at https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L110. |
got it. makes sense... i'd still keep these except branch blocks and the exception message handling in the Agent for now. It may look useless right now but i'm not 100% sure and i think it at least doesn't have to be part of this PR. @Kuba314 anything else that you want to do for this PR since you're keeping it in Draft status? or can we proceed with merging this in? |
Oops, no, sorry. It was just a proposed fix at the time of creation but now it's tested and approved by you so I'm marking it ready. |
Description
See commit message bodies for more info.
Tests
Normal SimpleNetworkRecipe execution passes, interrupting with ^C during iperf test also finishes well and doesn't crash agents. Not sure if this is enough testing though...
Closes: #354