Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the wait-for-apiserver ready check #713

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

mikkeloscar
Copy link
Contributor

With the introduction of Karpenter in #585 we changed the order of steps the CLM does. Before it was:

  1. Create master and wait for it to be ready (meaning check if there is any HTTP response below 500)
  2. Create and wait for worker nodes stacks
  3. Apply manifests.

With karpenter we changed it to:

  1. Create master and wait for it to be ready (meaning check if there is any HTTP response below 500)
  2. Apply manifests
  3. Create and wait for worker nodes stacks

Step 2 regularly fails during cluster creation in e2e with the error:

time="2023-11-28T10:32:13Z" level=debug msg="Waiting for API Server to be reachable"
time="2023-11-28T10:32:33Z" level=warning msg="New cluster (requested), skipping node pool update"
time="2023-11-28T10:32:33Z" level=debug msg="Running PreApply deletions (1)"
time="2023-11-28T10:32:33Z" level=fatal msg="Fail to provision: unable to delete: unable to resolve kind Deployment (use either name or name.version.group)"

I believe this is a symptom of the APIserver not being quite ready yet and because we do the apply step right after the apiserver has had a non 500 response it fails whenever it's not fully ready. Before we didn't see it because after checking apiserver availability we waited another 5-10 min. during worker node stack creation before doing the apply.

This PR aims to fix the issue by not just checking the availability of the apiserver, but ensuring that it responds 200 on the /readyz endpoint. The logic is that if /readyz is returning 200, then it must be fully ready and the apply calls should not fail like they sometimes do right now.

Since this only happens sometimes it's hard to prove that this works 100%, but at least I have tested that it works as expected in terms of detecting when the apiserver is available.

@mikkeloscar mikkeloscar force-pushed the better-wait-for-apiserver-check branch from d0462fe to 22995ce Compare November 30, 2023 11:36
@mikkeloscar
Copy link
Contributor Author

👍

provisioner/clusterpy.go Outdated Show resolved Hide resolved
@szuecs
Copy link
Member

szuecs commented Nov 30, 2023

👍

Signed-off-by: Mikkel Oscar Lyderik Larsen <[email protected]>
@mikkeloscar mikkeloscar force-pushed the better-wait-for-apiserver-check branch from 22995ce to dec714a Compare December 4, 2023 13:21
@mikkeloscar
Copy link
Contributor Author

👍

1 similar comment
@szuecs
Copy link
Member

szuecs commented Dec 4, 2023

👍

@szuecs szuecs merged commit 08ad0c2 into master Dec 4, 2023
9 checks passed
@szuecs szuecs deleted the better-wait-for-apiserver-check branch December 4, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants