Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PostingsForMatchersCache.PostingsForMatchers respect context cancellation #551

Closed

Conversation

aknuds1
Copy link
Contributor

@aknuds1 aknuds1 commented Oct 24, 2023

Make PostingsForMatchersCache.PostingsForMatchers respect context cancellation when executing queries (to populate cache), by executing them in a background goroutine taking a dedicated context, which gets canceled once no PostingsForMatchers caller needs it any more.

Fixes #548.

TODO:

  • Add test exercising concurrent access, where most accesses should use the cache

@aknuds1 aknuds1 force-pushed the arve/postings-for-matchers-cache-handle-cancellation branch from 3d8194b to 750587f Compare October 24, 2023 14:44
@aknuds1 aknuds1 added the bug Something isn't working label Oct 24, 2023
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work. I'm not super sure about the linked list traversal. I think if we change the linked list traversal, then we'll also take care of the race with the producer.

tsdb/postings_for_matchers_cache.go Show resolved Hide resolved
tsdb/postings_for_matchers_cache.go Outdated Show resolved Hide resolved
c.cachedMtx.Unlock()

// Wait for query execution goroutine to finish
<-promise.done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for waiting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not necessary any longer, have to review it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it shouldn't be necessary any longer (it was in a previous revision).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after removing this line I get a data race in tests. Not sure why yet.

Copy link
Contributor Author

@aknuds1 aknuds1 Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to read from promise.done here is that otherwise the background goroutine will block on sending to the channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then we're holding the whole request until the PostingsForMatchers finishes. Can we have a select with default or with case <-ctx.Done() when sending to promise.done? Then we can immediately return the cancelled request and let the background goroutine clean up when it's ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can solve the blocking on promise.done by moving the call to c.created to the goroutine, and promise.done is used only to indicate completion (see comment below) - then promise.done is only ever closed, and never sent to.

return nil, ctx.Err()
case <-p.done:
case <-promise.done:
// Checking context error is necessary for deterministic tests,
// as channel selection order is random
if ctx.Err() != nil {
return nil, ctx.Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but let's warp the error with something. Chasing plain context canceled isn't fun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment applies above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL. I also added a cause to the cancellation, to ease diagnosis.

Comment on lines 172 to 179
for elem := c.cached.Front(); elem != nil; elem = elem.Next() {
cc := elem.Value.(*postingsForMatchersCachedCall)
if cc.key == key {
c.cached.Remove(elem)
c.cachedBytes -= cc.sizeBytes
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this can be slow. Is there a way to make sure we don't insert it into the linked list if all the contexts are cancelled?

perhaps have a channel to which the producer writes an empty struct when it's done with producing the result. Any consumer will take this empty struct and commit to consuming the result. If there is no consumer waiting on the channel, then the producer just discards the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually more or less the same idea came to when reviewing the code this morning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing this idea @dimitarvdimitrov, by sending the cache entry size over promise.done for a consumer to pick up and commit, could you have a look?

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a complex change - how necessary is it that we cancel postings for matchers calls if the caller cancels? (How expensive are these calls?) And how often does the caller cancel?

I'm wondering if we can just accept that occasionally we'll waste some cycles completing a postings for matchers call for a caller that has already cancelled.

@aknuds1
Copy link
Contributor Author

aknuds1 commented Oct 25, 2023

I'm wondering if we can just accept that occasionally we'll waste some cycles completing a postings for matchers call for a caller that has already cancelled.

@charleskorn as discussed offline, I see your concern, and I think we'll need to analyze the necessity of being able to cancel executing queries in the light of past issues at Grafana Labs related to this.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is WIP, but I'd suggest adding some unit tests for the different cases as a next step, to help flush out some of the subtle bugs that might be introduced before we refactor and address other issues we've noticed.

Comment on lines +187 to +199
if sizeBytes > 0 {
// Got the promise cache entry's size, store it
c.created(key, c.timeNow(), sizeBytes)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the call to c.created be made in the goroutine spawned by PostingsForMatchers? Then done can simply be closed to signal that the promise has completed

c.calls[key] = promise
c.cachedMtx.Unlock()

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to its own method on postingsForMatchersPromise perhaps? This method is starting to get quite big.


// There are no more waiting goroutines, cancel the promise and remove it from the cache
promise.cancel(fmt.Errorf("no remaining callers interested in query"))
delete(c.calls, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right - calling delete here will remove the promise from the cache if a subsequent call's context is cancelled, even if the promise succeeded. Imagine this sequence of events:

  1. Initial call for a set of matchers arrives, promise succeeds and value is cached and returned.
  2. Subsequent call arrives with a context that is already cancelled. Cached promise is retrieved from cache, waiting is incremented to 1 and then waitOnPromise is called.
  3. waitOnPromise observes cancelled context, decrements waiting to 0 and removes cached promise from cache.

Instead, we only want to cancel and remove the call from c.calls if there's nothing waiting and the call is still in progress.

c.cachedMtx.Unlock()

// Wait for query execution goroutine to finish
<-promise.done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can solve the blocking on promise.done by moving the call to c.created to the goroutine, and promise.done is used only to indicate completion (see comment below) - then promise.done is only ever closed, and never sent to.

@aknuds1 aknuds1 force-pushed the arve/postings-for-matchers-cache-handle-cancellation branch from f1040da to 9ac9d68 Compare November 14, 2023 09:26
@pracucci
Copy link
Collaborator

pracucci commented Jul 4, 2024

Superseded by #644

@pracucci pracucci closed this Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Follow up on known PostingsForMatchersCache issues
4 participants