return success on graceful shutdown #6083

scottlamb · 2024-11-07T01:18:24Z

Proposed change

I'd like the NATS server to return exit code 0 (success) after graceful shutdown, instead of exit code 1 (failure).

Use case

It should be possible to monitor the NATS server for failures, without false positives on every shutdown. Unfortunately a real failure and a successful graceful shutdown both indicate the same status to the OS.

In #3514, @kozlovic suggested a workaround of sending a SIGINT instead of SIGTERM to start graceful shutdown; on SIGINT nats will correctly return 0. Unfortunately this looks painful to achieve, at least on AWS Elastic Container Service:

A 2021 AWS blog post says "Today, ECS always sends a SIGTERM, but in the future you will be able to override this by adding the STOPSIGNAL directive in your Dockerfile and/or task definition." AFAICT, that future has not happened yet.
I could build my own Docker image on top of the official one which adds a shell wrapper that traps SIGTERM and sends SIGINT to the nats process, but this is another thing to maintain as new nats images come out.

So I'd like to re-examine the rationale for returning exit code 1 on SIGTERM. In #2103, @wallyqs wrote:

Before 2.2 series, the TERM signal used to not be handled by the server, so it would not have been a clean exit. In 2.2, it was changed to process TERM signal as a clean exit but this affects the behavior of some tools that were expecting TERM to be exit 1.

and in #3514, @kozlovic wrote:

Although I agree that SIGTERM is supposed to be a normal exit, there is history in why we need to return status 1, and it has to do with some tooling or k8s if I recall correctly. In some situations, if we were to exit(0), the process would not be restarted by manage services.

I suspect the issue is that some folks have misconfigured Kubernetes. It see it has a restartPolicy configuration attribute; I would recommend setting it to Always for nats. Then nats can return status 0 (allowing correct monitoring) and will still be restarted correctly.

As for how to proceed, I see a couple viable approaches:

My preference: rip off the bandaid, make the change to correctly return 0. Call it out in the release notes, with a note to check that on Kubernetes restartPolicy is set to Always. Or, if nats maintainers deem that too risky (e.g. if you think folks will miss this note, or you fear there are other orchestrators affected that folks will not know how to configure correctly)...
Alternatively: conservatively introduce a commandline flag to switch the behavior: e.g. --shutdown_exit_code with options legacy (current behavior: 1 on SIGTERM, 0 on SIGINT), 1, or 0, defaulting to legacy. Ability to control commandline flags is universal; ability to control the signal used for graceful termination is not.

Contribution

Sure!

The text was updated successfully, but these errors were encountered:

scottlamb added the proposal Enhancement idea or proposal label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return success on graceful shutdown #6083

return success on graceful shutdown #6083

scottlamb commented Nov 7, 2024

return success on graceful shutdown #6083

return success on graceful shutdown #6083

Comments

scottlamb commented Nov 7, 2024

Proposed change

Use case

Contribution