Improve the wait-for-apiserver ready check #713

mikkeloscar · 2023-11-30T11:13:41Z

With the introduction of Karpenter in #585 we changed the order of steps the CLM does. Before it was:

Create master and wait for it to be ready (meaning check if there is any HTTP response below 500)
Create and wait for worker nodes stacks
Apply manifests.

With karpenter we changed it to:

Create master and wait for it to be ready (meaning check if there is any HTTP response below 500)
Apply manifests
Create and wait for worker nodes stacks

Step 2 regularly fails during cluster creation in e2e with the error:

time="2023-11-28T10:32:13Z" level=debug msg="Waiting for API Server to be reachable"
time="2023-11-28T10:32:33Z" level=warning msg="New cluster (requested), skipping node pool update"
time="2023-11-28T10:32:33Z" level=debug msg="Running PreApply deletions (1)"
time="2023-11-28T10:32:33Z" level=fatal msg="Fail to provision: unable to delete: unable to resolve kind Deployment (use either name or name.version.group)"

I believe this is a symptom of the APIserver not being quite ready yet and because we do the apply step right after the apiserver has had a non 500 response it fails whenever it's not fully ready. Before we didn't see it because after checking apiserver availability we waited another 5-10 min. during worker node stack creation before doing the apply.

This PR aims to fix the issue by not just checking the availability of the apiserver, but ensuring that it responds 200 on the /readyz endpoint. The logic is that if /readyz is returning 200, then it must be fully ready and the apply calls should not fail like they sometimes do right now.

Since this only happens sometimes it's hard to prove that this works 100%, but at least I have tested that it works as expected in terms of detecting when the apiserver is available.

mikkeloscar · 2023-11-30T12:07:57Z

👍

provisioner/clusterpy.go

szuecs · 2023-11-30T15:48:06Z

👍

Signed-off-by: Mikkel Oscar Lyderik Larsen <[email protected]>

mikkeloscar · 2023-12-04T13:21:30Z

👍

szuecs · 2023-12-04T19:54:38Z

👍

mikkeloscar force-pushed the better-wait-for-apiserver-check branch from d0462fe to 22995ce Compare November 30, 2023 11:36

szuecs reviewed Nov 30, 2023

View reviewed changes

provisioner/clusterpy.go Outdated Show resolved Hide resolved

Improve the wait-for-apiserver ready check

dec714a

Signed-off-by: Mikkel Oscar Lyderik Larsen <[email protected]>

mikkeloscar force-pushed the better-wait-for-apiserver-check branch from 22995ce to dec714a Compare December 4, 2023 13:21

szuecs merged commit 08ad0c2 into master Dec 4, 2023
9 checks passed

szuecs deleted the better-wait-for-apiserver-check branch December 4, 2023 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the wait-for-apiserver ready check #713

Improve the wait-for-apiserver ready check #713

mikkeloscar commented Nov 30, 2023

mikkeloscar commented Nov 30, 2023

szuecs commented Nov 30, 2023

mikkeloscar commented Dec 4, 2023

szuecs commented Dec 4, 2023

Improve the wait-for-apiserver ready check #713

Improve the wait-for-apiserver ready check #713

Conversation

mikkeloscar commented Nov 30, 2023

mikkeloscar commented Nov 30, 2023

szuecs commented Nov 30, 2023

mikkeloscar commented Dec 4, 2023

szuecs commented Dec 4, 2023