bug fix: service will never be deregisterd if the status is critical … #101

varnson · 2020-12-24T08:11:23Z

Step::
1, A service is in critical status.
2,start up the esm.
3, the critical will never be deregistered.

This bug is since 0350cc6

…when esm startup.

failureCounter and successCounter should count when HealthCritical or HealthPassing occured continuously.

eikenb · 2021-01-05T22:22:09Z

check.go

-	// Do nothing if update is idempotent
-	if check.Status == status && check.Output == output {
-		check.failureCounter = decrementCounter(check.failureCounter)
-		check.successCounter = decrementCounter(check.successCounter)


These are the only 2 uses of decrementCounter(). Should probably nuke it as well.

eikenb · 2021-01-05T22:26:55Z

check.go

@@ -319,6 +314,10 @@ func (c *CheckRunner) UpdateCheck(checkID types.CheckID, status, output string)
 		delete(c.checksCritical, checkID)
 	}

+	// Do nothing if update is idempotent
+	if check.Status == status && check.Output == output {
+		return


Just to be sure I'm understanding things... moving this return is the 'fix'. So it doesn't skip things between the old return and this one? Wanted to be sure as much of this just seems like a refactor/cleanup but it is fixing something and how this fixes it wasn't called out anywhere (or I missed it).

1, the bug is , if check.Status is cirtical at first(which esm got from consul-agent), and esm checked it again and found it's cirtical, then this process will return. But in fact we need to do something with it.
2, I had checked that this bug is since 0350cc6. I just revert the code to before commit 0350cc6.

lornasong · 2021-01-06T20:07:37Z

@varnson thanks for creating this PR. Would you be able to elaborate more on the steps you used to reproduce this bug? I was not able to reproduce it. Here is what I did:

Run consul agent -dev
Register an HTTP health check with critical status, 30s de-registration time, for a URL that will not return OK using PUT http://localhost:8500/v1/catalog/register:

{
    "Node": "http_node",
    "Address": "http://localhost:8400",
    "Service": {
        "ID": "http_service",
        "Service": "http_service"
    },
    "NodeMeta": {
        "external-node": "true",
        "external-probe": "false"
    },
    "Check": {
        "Node": "http_node",
        "Name": "http_esm_check",
        "Status": "critical",
        "ServiceID": "http_service",
        "Definition": {
            "HTTP": "http://localhost:8400/health",
            "Interval": "10s",
            "Timeout": "10s",
            "Method": "GET",
            "DeregisterCriticalServiceAfter": "30s"
        }
    }
}

Observe check + service in Consul UI: http://localhost:8500/ui/dc1/services
Run consul-esm
Wait and see in logs that critical health check is deregistered:

2021/01/06 12:48:58 [INFO] agent: Check "http_node/http_service/http_esm_check" for service "http_service" has been critical for too long (31.916567057s | timeout: 30s); deregistered service

Observe check + service are removed from Consul UI: http://localhost:8500/ui/dc1/services

Am I misunderstanding the issue or do you different steps/examples to reproduce this?

varnson · 2021-01-07T02:01:44Z

@lornasong Sorry for My comment is not clear enough.
the condition is (if check.Status == status && check.Output == output )
At your case ,status is critical, but output may not been equal.

At my case, I register a service, start esm, esm update the status and output to consul-agent, and I restart the esm.
Then the status and output are all the same, and will never be deregistered.

lornasong · 2021-01-07T18:18:33Z

@varnson I was able to reproduce your issue by setting the Output field on the 'Check Definition' when registering a health check. Before, I was not able to reproduce because it looks like there is a separate issue with service health checks where the latest output is not getting stored in Consul. ~~I’ll create an issue for this.~~ UPDATE: after looking into the issue of latest output not getting stored in Consul, this looks to be intentional behavior due to an optimization to delay updates if the status is unchanged.

I understand what you're saying: The status is critical but the latest status and output are the same as what is stored in Consul. Then the check cannot reach c.checksCritical[checkID] = time.Now()and therefore cannot be deregistered. Thanks!

I noticed you changed TestCheck_NoFlapping with a different behavior. This test is for the anti-flapping feature. If possible, we would like a fix that does not change this feature's behavior. I have not looked into this yet, but let me know if you have ideas to fix this!

varnson · 2021-01-09T08:12:05Z

@lornasong About anti-flapping feature, I hadn't read the consul source very carefully, but I found the source like this:
https://github.com/hashicorp/consul/blob/master/agent/checks/check.go#L828
This source is very like my source for this fix.
I think that the description of https://github.com/hashicorp/consul-esm#threshold-for-updating-check-status maybe not correct.
PASS and FAIL should be count continually, if PASS -> PASS ->FAIL, then count of PASS should be 0, FAIL count should be 1.
I changed the source and test in this way, but as your said, the behavior was changed.
But I worried the behavior of consul-esm maybe different with consul, and my change will fix it.
If I misunderstanding something, please let me know. Thanks.

lornasong · 2021-01-13T15:50:35Z

@varnson, yes, you're correct that the consul-esm anti-flapping behavior is different from consul. This is actually intentional.

You can read about the decision to implement a new and different behavior: #78 (comment). You can also read about the benefits of this different behavior: #78 (comment).

The documentation of consul-esm's anti-flapping behavior https://github.com/hashicorp/consul-esm#threshold-for-updating-check-status should be accurate for the different implementation. If this is not the case, please feel free to open another issue for it!

Let me know if you have any questions. Thanks.

varnson · 2021-01-15T13:30:45Z

@lornasong I have understand what you means, but it seems difficult to fix it.

hashicorp-cla · 2022-03-12T17:20:46Z

All committers have signed the CLA.

varnson added 2 commits December 24, 2020 16:07

bug fix: service will never be deregisterd if the status is critical …

7c4ba96

…when esm startup.

fix check_test.go

9ee9dfb

failureCounter and successCounter should count when HealthCritical or HealthPassing occured continuously.

findkim added the bug label Jan 5, 2021

findkim requested a review from a team January 5, 2021 21:38

eikenb reviewed Jan 5, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix: service will never be deregisterd if the status is critical … #101

bug fix: service will never be deregisterd if the status is critical … #101

varnson commented Dec 24, 2020

eikenb Jan 5, 2021

eikenb Jan 5, 2021

varnson Jan 6, 2021

lornasong commented Jan 6, 2021

varnson commented Jan 7, 2021

lornasong commented Jan 7, 2021 •

edited

Loading

varnson commented Jan 9, 2021 •

edited

Loading

lornasong commented Jan 13, 2021

varnson commented Jan 15, 2021

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

bug fix: service will never be deregisterd if the status is critical … #101

Are you sure you want to change the base?

bug fix: service will never be deregisterd if the status is critical … #101

Conversation

varnson commented Dec 24, 2020

eikenb Jan 5, 2021

Choose a reason for hiding this comment

eikenb Jan 5, 2021

Choose a reason for hiding this comment

varnson Jan 6, 2021

Choose a reason for hiding this comment

lornasong commented Jan 6, 2021

varnson commented Jan 7, 2021

lornasong commented Jan 7, 2021 • edited Loading

varnson commented Jan 9, 2021 • edited Loading

lornasong commented Jan 13, 2021

varnson commented Jan 15, 2021

hashicorp-cla commented Mar 12, 2022 • edited Loading

lornasong commented Jan 7, 2021 •

edited

Loading

varnson commented Jan 9, 2021 •

edited

Loading

hashicorp-cla commented Mar 12, 2022 •

edited

Loading