Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix: service will never be deregisterd if the status is critical … #101

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

varnson
Copy link
Contributor

@varnson varnson commented Dec 24, 2020

Step::
1, A service is in critical status.
2,start up the esm.
3, the critical will never be deregistered.

This bug is since 0350cc6

varnson added 2 commits December 24, 2020 16:07
failureCounter and successCounter should count when HealthCritical or HealthPassing occured continuously.
@findkim findkim added the bug label Jan 5, 2021
@findkim findkim requested a review from a team January 5, 2021 21:38
// Do nothing if update is idempotent
if check.Status == status && check.Output == output {
check.failureCounter = decrementCounter(check.failureCounter)
check.successCounter = decrementCounter(check.successCounter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only 2 uses of decrementCounter(). Should probably nuke it as well.

@@ -319,6 +314,10 @@ func (c *CheckRunner) UpdateCheck(checkID types.CheckID, status, output string)
delete(c.checksCritical, checkID)
}

// Do nothing if update is idempotent
if check.Status == status && check.Output == output {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I'm understanding things... moving this return is the 'fix'. So it doesn't skip things between the old return and this one? Wanted to be sure as much of this just seems like a refactor/cleanup but it is fixing something and how this fixes it wasn't called out anywhere (or I missed it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, the bug is , if check.Status is cirtical at first(which esm got from consul-agent), and esm checked it again and found it's cirtical, then this process will return. But in fact we need to do something with it.
2, I had checked that this bug is since 0350cc6. I just revert the code to before commit 0350cc6.

@lornasong
Copy link
Member

@varnson thanks for creating this PR. Would you be able to elaborate more on the steps you used to reproduce this bug? I was not able to reproduce it. Here is what I did:

  1. Run consul agent -dev
  2. Register an HTTP health check with critical status, 30s de-registration time, for a URL that will not return OK using PUT http://localhost:8500/v1/catalog/register:
{
    "Node": "http_node",
    "Address": "http://localhost:8400",
    "Service": {
        "ID": "http_service",
        "Service": "http_service"
    },
    "NodeMeta": {
        "external-node": "true",
        "external-probe": "false"
    },
    "Check": {
        "Node": "http_node",
        "Name": "http_esm_check",
        "Status": "critical",
        "ServiceID": "http_service",
        "Definition": {
            "HTTP": "http://localhost:8400/health",
            "Interval": "10s",
            "Timeout": "10s",
            "Method": "GET",
            "DeregisterCriticalServiceAfter": "30s"
        }
    }
}
  1. Observe check + service in Consul UI: http://localhost:8500/ui/dc1/services
  2. Run consul-esm
  3. Wait and see in logs that critical health check is deregistered:
2021/01/06 12:48:58 [INFO] agent: Check "http_node/http_service/http_esm_check" for service "http_service" has been critical for too long (31.916567057s | timeout: 30s); deregistered service
  1. Observe check + service are removed from Consul UI: http://localhost:8500/ui/dc1/services

Am I misunderstanding the issue or do you different steps/examples to reproduce this?

@varnson
Copy link
Contributor Author

varnson commented Jan 7, 2021

@lornasong Sorry for My comment is not clear enough.
the condition is (if check.Status == status && check.Output == output )
At your case ,status is critical, but output may not been equal.

At my case, I register a service, start esm, esm update the status and output to consul-agent, and I restart the esm.
Then the status and output are all the same, and will never be deregistered.

@lornasong
Copy link
Member

lornasong commented Jan 7, 2021

@varnson I was able to reproduce your issue by setting the Output field on the 'Check Definition' when registering a health check. Before, I was not able to reproduce because it looks like there is a separate issue with service health checks where the latest output is not getting stored in Consul. I’ll create an issue for this. UPDATE: after looking into the issue of latest output not getting stored in Consul, this looks to be intentional behavior due to an optimization to delay updates if the status is unchanged.

I understand what you're saying: The status is critical but the latest status and output are the same as what is stored in Consul. Then the check cannot reach c.checksCritical[checkID] = time.Now()and therefore cannot be deregistered. Thanks!

I noticed you changed TestCheck_NoFlapping with a different behavior. This test is for the anti-flapping feature. If possible, we would like a fix that does not change this feature's behavior. I have not looked into this yet, but let me know if you have ideas to fix this!

@varnson
Copy link
Contributor Author

varnson commented Jan 9, 2021

@lornasong About anti-flapping feature, I hadn't read the consul source very carefully, but I found the source like this:
https://github.com/hashicorp/consul/blob/master/agent/checks/check.go#L828
This source is very like my source for this fix.
I think that the description of https://github.com/hashicorp/consul-esm#threshold-for-updating-check-status maybe not correct.
PASS and FAIL should be count continually, if PASS -> PASS ->FAIL, then count of PASS should be 0, FAIL count should be 1.
I changed the source and test in this way, but as your said, the behavior was changed.
But I worried the behavior of consul-esm maybe different with consul, and my change will fix it.
If I misunderstanding something, please let me know. Thanks.

@lornasong
Copy link
Member

@varnson, yes, you're correct that the consul-esm anti-flapping behavior is different from consul. This is actually intentional.

You can read about the decision to implement a new and different behavior: #78 (comment). You can also read about the benefits of this different behavior: #78 (comment).

The documentation of consul-esm's anti-flapping behavior https://github.com/hashicorp/consul-esm#threshold-for-updating-check-status should be accurate for the different implementation. If this is not the case, please feel free to open another issue for it!

Let me know if you have any questions. Thanks.

@varnson
Copy link
Contributor Author

varnson commented Jan 15, 2021

@lornasong I have understand what you means, but it seems difficult to fix it.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants