Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return success on graceful shutdown #6083

Open
scottlamb opened this issue Nov 7, 2024 · 0 comments
Open

return success on graceful shutdown #6083

scottlamb opened this issue Nov 7, 2024 · 0 comments
Labels
proposal Enhancement idea or proposal

Comments

@scottlamb
Copy link

Proposed change

I'd like the NATS server to return exit code 0 (success) after graceful shutdown, instead of exit code 1 (failure).

Use case

It should be possible to monitor the NATS server for failures, without false positives on every shutdown. Unfortunately a real failure and a successful graceful shutdown both indicate the same status to the OS.

In #3514, @kozlovic suggested a workaround of sending a SIGINT instead of SIGTERM to start graceful shutdown; on SIGINT nats will correctly return 0. Unfortunately this looks painful to achieve, at least on AWS Elastic Container Service:

  • A 2021 AWS blog post says "Today, ECS always sends a SIGTERM, but in the future you will be able to override this by adding the STOPSIGNAL directive in your Dockerfile and/or task definition." AFAICT, that future has not happened yet.
  • I could build my own Docker image on top of the official one which adds a shell wrapper that traps SIGTERM and sends SIGINT to the nats process, but this is another thing to maintain as new nats images come out.

So I'd like to re-examine the rationale for returning exit code 1 on SIGTERM. In #2103, @wallyqs wrote:

Before 2.2 series, the TERM signal used to not be handled by the server, so it would not have been a clean exit. In 2.2, it was changed to process TERM signal as a clean exit but this affects the behavior of some tools that were expecting TERM to be exit 1.

and in #3514, @kozlovic wrote:

Although I agree that SIGTERM is supposed to be a normal exit, there is history in why we need to return status 1, and it has to do with some tooling or k8s if I recall correctly. In some situations, if we were to exit(0), the process would not be restarted by manage services.

I suspect the issue is that some folks have misconfigured Kubernetes. It see it has a restartPolicy configuration attribute; I would recommend setting it to Always for nats. Then nats can return status 0 (allowing correct monitoring) and will still be restarted correctly.

As for how to proceed, I see a couple viable approaches:

  1. My preference: rip off the bandaid, make the change to correctly return 0. Call it out in the release notes, with a note to check that on Kubernetes restartPolicy is set to Always. Or, if nats maintainers deem that too risky (e.g. if you think folks will miss this note, or you fear there are other orchestrators affected that folks will not know how to configure correctly)...
  2. Alternatively: conservatively introduce a commandline flag to switch the behavior: e.g. --shutdown_exit_code with options legacy (current behavior: 1 on SIGTERM, 0 on SIGINT), 1, or 0, defaulting to legacy. Ability to control commandline flags is universal; ability to control the signal used for graceful termination is not.

Contribution

Sure!

@scottlamb scottlamb added the proposal Enhancement idea or proposal label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

No branches or pull requests

1 participant