Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade resty v1 to v2. #424

Conversation

A-Kamaee
Copy link
Contributor

One-line summary

This PR upgrades resty from v1 to v2 to fix a retry-related bug in es-operator.

Description

After implementing context-aware retries in es-operator, we noticed that the operator could get stuck, preventing cluster scaling. To diagnose the issue, we created an Elasticsearch cluster in our playground environment and successfully reproduced the scenario.

What is the problem?

When a scale-in operation is halted due to a new scale-out request, es-operator can get stuck, posing a significant risk to the system.

Why does es-operator get stuck?

We were using resty/v1.12.0, released on Feb 28, 2019. The project has since released a new version (resty/v2), which is not backward compatible, and we had not upgraded to it. There is a discrepancy between the documentation and the implementation of retry behavior in resty/v1.12.0.

According to the resty/v1.12.0 documentation on the retry function condition: Link to documentation

The request will retry if any of the functions return true and error is nil.

This means the retry should stop if the retry condition function returns false or if there is an error. However, upon reviewing the code, it turns out the documentation is incorrect. If a retry condition function returns an error, the retry continues: Link to Implementation

resp, err = operation()

var needsRetry bool
var conditionErr error
for _, condition := range opts.retryConditions {
    needsRetry, conditionErr = condition(resp)
    if needsRetry || conditionErr != nil {
        break
    }
}

// If the operation returned no error, there was no condition satisfied, and
// there was no error caused by the conditional functions.
if err == nil && !needsRetry && conditionErr == nil {
    return nil
}

It's surprising that such a significant discrepancy exists in an open-source project, but after thorough verification, we are confident this is the case.

How did we solve the problem?

We upgraded resty to the latest version and adjusted our implementation accordingly. Testing the same scenario with the new version resolved the issue.

Types of Changes

  • Bug fix (non-breaking change that fixes an issue)

@A-Kamaee A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from cbeccc4 to 2097007 Compare June 20, 2024 14:46
@A-Kamaee
Copy link
Contributor Author

I noticed upgrade resty needs a refactor in code to be able to use httpmock in our test. I will add a new commit to resolve problem in our tests.

@A-Kamaee A-Kamaee marked this pull request as draft June 20, 2024 15:11
…nsport so HTTP Mock can intercept requests.

Signed-off-by: Abouzar Kamaee <[email protected]>
@A-Kamaee A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from 219444c to 7b3a86e Compare June 20, 2024 16:23
@A-Kamaee
Copy link
Contributor Author

A-Kamaee commented Jun 20, 2024

Explanation for Commit Resolve Tests Errors: Use Default HTTP Client with Default HTTP Transport so HTTP Mock Can Intercept Requests

In resty/v2, creating a new Resty client initializes a new transport, as shown in the createClient function:

if hc.Transport == nil {
    hc.Transport = createTransport(nil)
}

For comparison, you can see the difference in the createClient function between resty/v1 and resty/v2.

In resty/v2, instead of using the default transport, a new transport is created for each client instance. This change is not efficient, as it prevents the reuse of the same transport across multiple requests. Additionally, it causes issues with httpmock, which can only intercept requests sent through the default HTTP client.

To address these issues, I modified the implementation to use the default HTTP client with the default HTTP transport when creating a new Resty client. This ensures efficient transport reuse and allows httpmock to properly intercept the requests.

…tion should be called 6 times in total while in resty/v1 it meant the function should has been called 5 times.

Signed-off-by: Abouzar Kamaee <[email protected]>
@A-Kamaee A-Kamaee force-pushed the resolve-context-canceled-retry-error branch from 709a57d to bbf2ead Compare June 20, 2024 16:41
@A-Kamaee A-Kamaee marked this pull request as ready for review June 20, 2024 16:42
Signed-off-by: Abouzar Kamaee <[email protected]>
// Counter to track the number of retries
retryCount := 0

_, err := resty.NewWithClient(&http.Client{Transport: http.DefaultTransport}).
SetRetryCount(c.DrainingConfig.MaxRetries).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default max retries is 999 which is a bit too much. Let's reduce it to a more fail fast strategy to avoid elongated loop of retries. Even if the current loop fails, the operator must ideally pick up where it left off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is configurable right? We can lower this for our own deployment.

@otrosien
Copy link
Member

👍

@hooseins
Copy link

hooseins commented Jun 21, 2024

I would suggest having a unit test to set the expectations around the output of the Drain() function when the context is canceled. #424 (comment) I know this change is not actually about the output of the function (as this is about fixing the stuck operator) but we didn't have unit tests in #405 and Drain() might have a different output in this PR than expected. Even if not I think It would be good to have one for completeness

@A-Kamaee
Copy link
Contributor Author

👍

1 similar comment
@A-Kamaee
Copy link
Contributor Author

👍

@A-Kamaee A-Kamaee merged commit f095fbc into zalando-incubator:master Jun 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants