Upgrade resty v1 to v2. #424

A-Kamaee · 2024-06-20T14:42:31Z

One-line summary

This PR upgrades resty from v1 to v2 to fix a retry-related bug in es-operator.

Description

After implementing context-aware retries in es-operator, we noticed that the operator could get stuck, preventing cluster scaling. To diagnose the issue, we created an Elasticsearch cluster in our playground environment and successfully reproduced the scenario.

What is the problem?

When a scale-in operation is halted due to a new scale-out request, es-operator can get stuck, posing a significant risk to the system.

Why does es-operator get stuck?

We were using resty/v1.12.0, released on Feb 28, 2019. The project has since released a new version (resty/v2), which is not backward compatible, and we had not upgraded to it. There is a discrepancy between the documentation and the implementation of retry behavior in resty/v1.12.0.

According to the resty/v1.12.0 documentation on the retry function condition: Link to documentation

The request will retry if any of the functions return true and error is nil.

This means the retry should stop if the retry condition function returns false or if there is an error. However, upon reviewing the code, it turns out the documentation is incorrect. If a retry condition function returns an error, the retry continues: Link to Implementation

resp, err = operation()

var needsRetry bool
var conditionErr error
for _, condition := range opts.retryConditions {
    needsRetry, conditionErr = condition(resp)
    if needsRetry || conditionErr != nil {
        break
    }
}

// If the operation returned no error, there was no condition satisfied, and
// there was no error caused by the conditional functions.
if err == nil && !needsRetry && conditionErr == nil {
    return nil
}

It's surprising that such a significant discrepancy exists in an open-source project, but after thorough verification, we are confident this is the case.

How did we solve the problem?

We upgraded resty to the latest version and adjusted our implementation accordingly. Testing the same scenario with the new version resolved the issue.

Types of Changes

Bug fix (non-breaking change that fixes an issue)

…v2". Signed-off-by: Abouzar Kamaee <[email protected]>

A-Kamaee · 2024-06-20T15:10:37Z

I noticed upgrade resty needs a refactor in code to be able to use httpmock in our test. I will add a new commit to resolve problem in our tests.

operator/es_client.go

…nsport so HTTP Mock can intercept requests. Signed-off-by: Abouzar Kamaee <[email protected]>

A-Kamaee · 2024-06-20T16:35:33Z

Explanation for Commit `Resolve Tests Errors: Use Default HTTP Client with Default HTTP Transport so HTTP Mock Can Intercept Requests`

In resty/v2, creating a new Resty client initializes a new transport, as shown in the createClient function:

if hc.Transport == nil {
    hc.Transport = createTransport(nil)
}

For comparison, you can see the difference in the createClient function between resty/v1 and resty/v2.

In resty/v2, instead of using the default transport, a new transport is created for each client instance. This change is not efficient, as it prevents the reuse of the same transport across multiple requests. Additionally, it causes issues with httpmock, which can only intercept requests sent through the default HTTP client.

To address these issues, I modified the implementation to use the default HTTP client with the default HTTP transport when creating a new Resty client. This ensures efficient transport reuse and allows httpmock to properly intercept the requests.

…tion should be called 6 times in total while in resty/v1 it meant the function should has been called 5 times. Signed-off-by: Abouzar Kamaee <[email protected]>

Signed-off-by: Abouzar Kamaee <[email protected]>

girishc13 · 2024-06-21T07:03:38Z

operator/es_client.go

+	// Counter to track the number of retries
+	retryCount := 0
+
+	_, err := resty.NewWithClient(&http.Client{Transport: http.DefaultTransport}).
 		SetRetryCount(c.DrainingConfig.MaxRetries).


The default max retries is 999 which is a bit too much. Let's reduce it to a more fail fast strategy to avoid elongated loop of retries. Even if the current loop fails, the operator must ideally pick up where it left off.

it is configurable right? We can lower this for our own deployment.

otrosien · 2024-06-21T10:22:19Z

👍

hooseins · 2024-06-21T10:57:38Z

I would suggest having a unit test to set the expectations around the output of the Drain() function when the context is canceled. #424 (comment) I know this change is not actually about the output of the function (as this is about fixing the stuck operator) but we didn't have unit tests in #405 and Drain() might have a different output in this PR than expected. Even if not I think It would be good to have one for completeness

A-Kamaee · 2024-06-25T11:31:23Z

👍

A-Kamaee · 2024-06-25T11:53:23Z

👍

A-Kamaee requested review from mikkeloscar and otrosien as code owners June 20, 2024 14:42

BugFix: "github.com/go-resty/resty/v1" to "github.com/go-resty/resty/…

2097007

…v2". Signed-off-by: Abouzar Kamaee <[email protected]>

A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from cbeccc4 to 2097007 Compare June 20, 2024 14:46

A-Kamaee marked this pull request as draft June 20, 2024 15:11

hooseins reviewed Jun 20, 2024

View reviewed changes

operator/es_client.go Show resolved Hide resolved

Resolve Tests Errors: Use default HTTP Client with a default HTTP Tra…

7b3a86e

…nsport so HTTP Mock can intercept requests. Signed-off-by: Abouzar Kamaee <[email protected]>

A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from 219444c to 7b3a86e Compare June 20, 2024 16:23

Resolve error in tests: 5 retries in resty/v2 correctly mean the func…

bbf2ead

…tion should be called 6 times in total while in resty/v1 it meant the function should has been called 5 times. Signed-off-by: Abouzar Kamaee <[email protected]>

A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from 709a57d to bbf2ead Compare June 20, 2024 16:41

A-Kamaee marked this pull request as ready for review June 20, 2024 16:42

Run goimport.

599c9ab

Signed-off-by: Abouzar Kamaee <[email protected]>

girishc13 reviewed Jun 21, 2024

View reviewed changes

A-Kamaee merged commit f095fbc into zalando-incubator:master Jun 25, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade resty v1 to v2. #424

Upgrade resty v1 to v2. #424

A-Kamaee commented Jun 20, 2024

A-Kamaee commented Jun 20, 2024

A-Kamaee commented Jun 20, 2024 •

edited

Loading

girishc13 Jun 21, 2024

otrosien Jun 21, 2024

otrosien commented Jun 21, 2024

hooseins commented Jun 21, 2024 •

edited

Loading

A-Kamaee commented Jun 25, 2024

A-Kamaee commented Jun 25, 2024

Upgrade resty v1 to v2. #424

Upgrade resty v1 to v2. #424

Conversation

A-Kamaee commented Jun 20, 2024

One-line summary

Description

What is the problem?

Why does es-operator get stuck?

How did we solve the problem?

Types of Changes

A-Kamaee commented Jun 20, 2024

A-Kamaee commented Jun 20, 2024 • edited Loading

Explanation for Commit Resolve Tests Errors: Use Default HTTP Client with Default HTTP Transport so HTTP Mock Can Intercept Requests

girishc13 Jun 21, 2024

Choose a reason for hiding this comment

otrosien Jun 21, 2024

Choose a reason for hiding this comment

otrosien commented Jun 21, 2024

hooseins commented Jun 21, 2024 • edited Loading

A-Kamaee commented Jun 25, 2024

A-Kamaee commented Jun 25, 2024

A-Kamaee commented Jun 20, 2024 •

edited

Loading

Explanation for Commit `Resolve Tests Errors: Use Default HTTP Client with Default HTTP Transport so HTTP Mock Can Intercept Requests`

hooseins commented Jun 21, 2024 •

edited

Loading