-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
online upgrade with package installation and pending pods #961
Conversation
|
||
mainPFacts := r.PFacts[vapi.MainCluster] | ||
// All nodes in the main cluster must be up before running online upgrade | ||
if mainPFacts.getUpNodeCount() != len(mainPFacts.Detail) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want all the pods to be running before starting online upgrade. They don't all have to be up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not working after changed to countRunningAndInstalled()
. When the pending nodes just became running (not up yet), the upgrade process continued to duplicate the main cluster but returned VAddnode error. That running node and the new added nodes never came up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, don't even check the number of upnode. Check if there is any not running pod and requeue if there is. Then Just call the restart reconciler, if any node is down it will start it, if no node is down it will do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the instruction. It is updated accordingly.
mainPFacts := r.PFacts[vapi.MainCluster] | ||
// All nodes in the main cluster must be up before running online upgrade | ||
if mainPFacts.getUpNodeCount() != len(mainPFacts.Detail) { | ||
r.VRec.Eventf(r.VDB, corev1.EventTypeWarning, events.NotAllNodesUp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need an event. Just log something.
pkg/vadmin/install_packages_at.go
Outdated
@@ -34,7 +35,8 @@ func (a *Admintools) InstallPackages(ctx context.Context, opts ...installpackage | |||
|
|||
status := genInstallPackageStatus(stdout) | |||
if err != nil && len(status.Packages) == 0 { | |||
_, logErr := a.logFailure("install_package", events.InstallPackagesFailed, stdout, err) | |||
pkgErr := errors.New(err.Error() + "This may due to lack of memory resources.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to distinguish the cause of the error? Simply adding a string here can be misleading for future debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Now it only logs a message if 0 package was installed.
i.Rec.Eventf(i.Vdb, corev1.EventTypeNormal, events.IncompatibleUpgradeRequested, | ||
"Requested upgrade is incompatible with the Vertica deployment. Falling back to %s upgrade.", actualUpgradeAsText) | ||
|
||
if i.Vdb.Spec.UpgradePolicy == "Online" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this before the event recorder.
The main now has the lint fix. |
@@ -165,6 +165,8 @@ func (r *OnlineUpgradeReconciler) Reconcile(ctx context.Context, _ *ctrl.Request | |||
|
|||
// Functions to perform when the image changes. Order matters. | |||
funcs := []func(context.Context) (ctrl.Result, error){ | |||
// Requeue if not all nodes are running | |||
r.requeuePodsNotRunning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a status msg in onlineUpgradeStatusMsgs
for this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put them after postStartOnlineUpgradeMsg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test does too many things already validated by other tests. I think what you want to check is that we do not pass the first function in online upgrade if some pods are pending so make some pod pending, start online upgrade, wait a little and check the upgradeStatus to verify that we are still stuck in that function, make the pending pods run and now wait until we create the new subclusters and the test can end there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I removed the duplicated checks and renamed the test to online-upgrade-pods-pending.
@@ -165,6 +165,8 @@ func (r *OnlineUpgradeReconciler) Reconcile(ctx context.Context, _ *ctrl.Request | |||
|
|||
// Functions to perform when the image changes. Order matters. | |||
funcs := []func(context.Context) (ctrl.Result, error){ | |||
// Requeue if not all nodes are running | |||
r.requeuePodsNotRunning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put them after postStartOnlineUpgradeMsg
apiVersion: apps/v1 | ||
kind: StatefulSet | ||
metadata: | ||
name: v-base-upgrade-sec2 | ||
status: | ||
availableReplicas: 0 | ||
replicas: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not necessarily mean the pods are pending. You need to validate the pod itself:
apiVersion: apps/v1 | |
kind: StatefulSet | |
metadata: | |
name: v-base-upgrade-sec2 | |
status: | |
availableReplicas: 0 | |
replicas: 1 | |
apiVersion: v1 | |
kind: Pod | |
metadata: | |
name: v-base-upgrade-sec2-0 | |
status: | |
phase: pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an assert file where you check if the pod is running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an assert that checks the upgradeStatus:
apiVersion: v1
kind: Pod
metadata:
name: v-base-upgrade-sec2-0
status:
phase: pending
---
apiVersion: vertica.com/v1beta1
kind: VerticaDB
metadata:
name: v-base-upgrade
status:
upgradeStatus: "Requeue as pods not running"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated according to the comments. Please help to review again.
@@ -24,7 +24,7 @@ metadata: | |||
spec: | |||
image: kustomize-vertica-image | |||
dbName: repUP | |||
initPolicy: Create | |||
initPolicy: CreateSkipPackageInstall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the test online-upgrade-pods-pending stops at the steps of wait-for-scs-mimic, the package installation (at the step of a new Vertica version is installed) is not covered in the test. So we can either
- run the test with CreateSkipPackageInstall
- add the upgrade steps back to cover the package installation
Fixes two issues during the online upgrade process:
An e2e test was added to cover the test with package installation and a pending pod.