Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online upgrade with package installation and pending pods #961

Merged
merged 32 commits into from
Oct 30, 2024

Conversation

qindotguan
Copy link
Collaborator

Fixes two issues during the online upgrade process:

  1. Try to restart the pending nodes and report a warning event to remind the user to check the DB configuration.
  2. To print a hint that package installation failure may be due to the lack of memory resources.

An e2e test was added to cover the test with package installation and a pending pod.


mainPFacts := r.PFacts[vapi.MainCluster]
// All nodes in the main cluster must be up before running online upgrade
if mainPFacts.getUpNodeCount() != len(mainPFacts.Detail) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want all the pods to be running before starting online upgrade. They don't all have to be up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not working after changed to countRunningAndInstalled(). When the pending nodes just became running (not up yet), the upgrade process continued to duplicate the main cluster but returned VAddnode error. That running node and the new added nodes never came up.

Copy link
Collaborator

@roypaulin roypaulin Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, don't even check the number of upnode. Check if there is any not running pod and requeue if there is. Then Just call the restart reconciler, if any node is down it will start it, if no node is down it will do nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the instruction. It is updated accordingly.

mainPFacts := r.PFacts[vapi.MainCluster]
// All nodes in the main cluster must be up before running online upgrade
if mainPFacts.getUpNodeCount() != len(mainPFacts.Detail) {
r.VRec.Eventf(r.VDB, corev1.EventTypeWarning, events.NotAllNodesUp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need an event. Just log something.

@@ -34,7 +35,8 @@ func (a *Admintools) InstallPackages(ctx context.Context, opts ...installpackage

status := genInstallPackageStatus(stdout)
if err != nil && len(status.Packages) == 0 {
_, logErr := a.logFailure("install_package", events.InstallPackagesFailed, stdout, err)
pkgErr := errors.New(err.Error() + "This may due to lack of memory resources.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to distinguish the cause of the error? Simply adding a string here can be misleading for future debugging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Now it only logs a message if 0 package was installed.

i.Rec.Eventf(i.Vdb, corev1.EventTypeNormal, events.IncompatibleUpgradeRequested,
"Requested upgrade is incompatible with the Vertica deployment. Falling back to %s upgrade.", actualUpgradeAsText)

if i.Vdb.Spec.UpgradePolicy == "Online" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this before the event recorder.

@roypaulin
Copy link
Collaborator

The main now has the lint fix.

@@ -165,6 +165,8 @@ func (r *OnlineUpgradeReconciler) Reconcile(ctx context.Context, _ *ctrl.Request

// Functions to perform when the image changes. Order matters.
funcs := []func(context.Context) (ctrl.Result, error){
// Requeue if not all nodes are running
r.requeuePodsNotRunning,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a status msg in onlineUpgradeStatusMsgs for this step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put them after postStartOnlineUpgradeMsg

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test does too many things already validated by other tests. I think what you want to check is that we do not pass the first function in online upgrade if some pods are pending so make some pod pending, start online upgrade, wait a little and check the upgradeStatus to verify that we are still stuck in that function, make the pending pods run and now wait until we create the new subclusters and the test can end there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I removed the duplicated checks and renamed the test to online-upgrade-pods-pending.

@@ -165,6 +165,8 @@ func (r *OnlineUpgradeReconciler) Reconcile(ctx context.Context, _ *ctrl.Request

// Functions to perform when the image changes. Order matters.
funcs := []func(context.Context) (ctrl.Result, error){
// Requeue if not all nodes are running
r.requeuePodsNotRunning,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put them after postStartOnlineUpgradeMsg

Comment on lines 14 to 20
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: v-base-upgrade-sec2
status:
availableReplicas: 0
replicas: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not necessarily mean the pods are pending. You need to validate the pod itself:

Suggested change
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: v-base-upgrade-sec2
status:
availableReplicas: 0
replicas: 1
apiVersion: v1
kind: Pod
metadata:
name: v-base-upgrade-sec2-0
status:
phase: pending

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assert file where you check if the pod is running

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assert that checks the upgradeStatus:

apiVersion: v1
kind: Pod
metadata:
  name: v-base-upgrade-sec2-0
status:
 phase: pending
---
apiVersion: vertica.com/v1beta1
kind: VerticaDB
metadata:
  name: v-base-upgrade
status:
  upgradeStatus: "Requeue as pods not running"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated according to the comments. Please help to review again.

@@ -24,7 +24,7 @@ metadata:
spec:
image: kustomize-vertica-image
dbName: repUP
initPolicy: Create
initPolicy: CreateSkipPackageInstall
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

@qindotguan qindotguan Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the test online-upgrade-pods-pending stops at the steps of wait-for-scs-mimic, the package installation (at the step of a new Vertica version is installed) is not covered in the test. So we can either

  1. run the test with CreateSkipPackageInstall
  2. add the upgrade steps back to cover the package installation

@qindotguan qindotguan merged commit a3abb8e into main Oct 30, 2024
39 checks passed
@qindotguan qindotguan deleted the qguan/online-upgrade-packages branch October 30, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants