Revisions stay in ContainerHealthy=False status forever #15487

SaschaSchwarze0 · 2024-08-30T09:54:40Z

In what area(s)?

/area API

We think that #14744 introduced a regression due to the change it made in revision_lifecycle.go. Let's go through it:

At any point in time, there might be a failure in a container in a pod that is part of the KService. Errors like OOM are recoverable = Kubernetes will restart the pod and assuming the OOM will not immediatly come back, the Pod will be in Running state and everything is working.

When the OOM happens, then the logic in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L107 will call MarkContainerHealthyFalse and sets the ContainerHealthy condition of the revision to False. So far so good.

The only code location that resets this, is in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L202. But this code never runs anymore since #14744. Basically if ContainerHealthy is False, then resUnavailable is set to true in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L173. If resUnavailable is true, then it never enters the code in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L198-L203.

What version of Knative?

Latest release.

Expected Behavior

The ContainerHealthy condition is set back to True when all containers are healthy.

Actual Behavior

The ContainerHealthy condition is never set from False to True.

Steps to Reproduce the Problem

Create a KSvc using the image ghcr.io/src2img/http-synthetics:latest (code is from https://github.com/src2img/http-synthetics), and specify a memory limit on the container, for example 100M. Use minScale and maxScale 1.
Wait for the Ksvc to become ready.
Call its endpoint with curl -X PUT http://endpoint/claim-memory?amount=500000000

The call will fail because the container will go OOM. The revision switches to ContainerHealthy=False. Kubernetes restarts the container and it will be running again. But the revision status will never change back to healthy.

We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready. But not sure how Knative specifies the behavior there.

The text was updated successfully, but these errors were encountered:

SaschaSchwarze0 · 2024-08-30T12:21:40Z

The idea we have to fix this would be to set the ContainerHealthy condition back to true near the code where it is actually set to false. See SaschaSchwarze0@9aab510 for details. Then one could probably remove any thing related to the ContainerHealthy condition from revision_lifecycle.go (not included in that commit).

Let us know if that makes sense, we can make a PR out of it.

skonto · 2024-09-09T09:42:24Z

Hi @SaschaSchwarze0 would you be able to make a PR, so we can test it? cc @dprotaso who has reviewed the the previous fix.

We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready

One thing that is confusing is whether availability should be enough since a deployment is available when a number of replicas are ready for minSecondsReady (0 in Knative). In the past the ContainerHealthy condition was added to signal readiness check completion.

// RevisionConditionContainerHealthy is set when the revision readiness check completes.
RevisionConditionContainerHealthy apis.ConditionType = "ContainerHealthy"

I am wondering if we still need that condition and we could just propagate the message of a container that exited with using the ResourcesAvailable condition only. We do propagate the deployment status so if availability resets to ready=true then it should reset the availability for the revision.

I think the status propagation model needs to be documented as several users have asked how to detect failures.

SaschaSchwarze0 · 2024-09-09T14:07:31Z

Hi @skonto, thanks for checking that issue. I have opened above commit as PR: #15503

One thing that is confusing is whether availability should be enough since a deployment is available when a number of replicas are ready for minSecondsReady (0 in Knative). In the past the ContainerHealthy condition was added to signal readiness check completion.

The code change we are proposing is NOT based on .status.availableReplicas of the Deployment which indeed could match spec.replicas even when some pods are not (anymore) ready. Instead we are using .status.readyReplicas which afaik means that the pods must all be Ready in that moment. (I wonder actually whether one should check updatedReplicas as well.)

I understand that there are reasons to distinguish between resources available and containers healthy just from an end user's perspective. I think the complexity that exists is the scaling of deployments through auto-scaling. One would imo not want a Revision to become not ready just because it scales up and that the newly added pod is not (yet) schedule/ready. Without that complexity, calculating the two conditions purely based on the Deployment status would be no big deal.

And yeah, I also agree that a clear documentation that states the meaning of the conditions and when they transition and when not could be helpful.

SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 30, 2024

knative-prow bot added the area/API API objects and controllers label Aug 30, 2024

maschmid mentioned this issue Sep 6, 2024

Revision stays in ContainerMissing condition forever after a temporary failure of digest resolution #15466

Open

SaschaSchwarze0 linked a pull request Sep 9, 2024 that will close this issue

Ensure ContainerHealthy condition is set back to True #15503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisions stay in ContainerHealthy=False status forever #15487

Revisions stay in ContainerHealthy=False status forever #15487

SaschaSchwarze0 commented Aug 30, 2024 •

edited

Loading

SaschaSchwarze0 commented Aug 30, 2024 •

edited

Loading

skonto commented Sep 9, 2024 •

edited

Loading

SaschaSchwarze0 commented Sep 9, 2024

Revisions stay in ContainerHealthy=False status forever #15487

Revisions stay in ContainerHealthy=False status forever #15487

Comments

SaschaSchwarze0 commented Aug 30, 2024 • edited Loading

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

SaschaSchwarze0 commented Aug 30, 2024 • edited Loading

skonto commented Sep 9, 2024 • edited Loading

SaschaSchwarze0 commented Sep 9, 2024

SaschaSchwarze0 commented Aug 30, 2024 •

edited

Loading

SaschaSchwarze0 commented Aug 30, 2024 •

edited

Loading

skonto commented Sep 9, 2024 •

edited

Loading