-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisions stay in ContainerHealthy=False status forever #15487
Comments
The idea we have to fix this would be to set the ContainerHealthy condition back to true near the code where it is actually set to false. See SaschaSchwarze0@9aab510 for details. Then one could probably remove any thing related to the ContainerHealthy condition from revision_lifecycle.go (not included in that commit). Let us know if that makes sense, we can make a PR out of it. |
Hi @SaschaSchwarze0 would you be able to make a PR, so we can test it? cc @dprotaso who has reviewed the the previous fix.
One thing that is confusing is whether availability should be enough since a deployment is available when a number of replicas are ready for minSecondsReady (0 in Knative). In the past the ContainerHealthy condition was added to signal readiness check completion.
I am wondering if we still need that condition and we could just propagate the message of a container that exited with using the ResourcesAvailable condition only. We do propagate the deployment status so if availability resets to ready=true then it should reset the availability for the revision. I think the status propagation model needs to be documented as several users have asked how to detect failures. |
Hi @skonto, thanks for checking that issue. I have opened above commit as PR: #15503
The code change we are proposing is NOT based on I understand that there are reasons to distinguish between resources available and containers healthy just from an end user's perspective. I think the complexity that exists is the scaling of deployments through auto-scaling. One would imo not want a Revision to become not ready just because it scales up and that the newly added pod is not (yet) schedule/ready. Without that complexity, calculating the two conditions purely based on the Deployment status would be no big deal. And yeah, I also agree that a clear documentation that states the meaning of the conditions and when they transition and when not could be helpful. |
In what area(s)?
/area API
We think that #14744 introduced a regression due to the change it made in revision_lifecycle.go. Let's go through it:
At any point in time, there might be a failure in a container in a pod that is part of the KService. Errors like OOM are recoverable = Kubernetes will restart the pod and assuming the OOM will not immediatly come back, the Pod will be in Running state and everything is working.
When the OOM happens, then the logic in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L107 will call
MarkContainerHealthyFalse
and sets theContainerHealthy
condition of the revision to False. So far so good.The only code location that resets this, is in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L202. But this code never runs anymore since #14744. Basically if
ContainerHealthy
is False, thenresUnavailable
is set to true in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L173. IfresUnavailable
is true, then it never enters the code in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L198-L203.What version of Knative?
Latest release.
Expected Behavior
The ContainerHealthy condition is set back to True when all containers are healthy.
Actual Behavior
The ContainerHealthy condition is never set from False to True.
Steps to Reproduce the Problem
ghcr.io/src2img/http-synthetics:latest
(code is from https://github.com/src2img/http-synthetics), and specify a memory limit on the container, for example100M
. Use minScale and maxScale 1.curl -X PUT http://endpoint/claim-memory?amount=500000000
The call will fail because the container will go OOM. The revision switches to ContainerHealthy=False. Kubernetes restarts the container and it will be running again. But the revision status will never change back to healthy.
We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready. But not sure how Knative specifies the behavior there.
The text was updated successfully, but these errors were encountered: