-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make PostingsForMatchersCache.PostingsForMatchers
respect context cancellation
#551
Make PostingsForMatchersCache.PostingsForMatchers
respect context cancellation
#551
Conversation
3d8194b
to
750587f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the work. I'm not super sure about the linked list traversal. I think if we change the linked list traversal, then we'll also take care of the race with the producer.
c.cachedMtx.Unlock() | ||
|
||
// Wait for query execution goroutine to finish | ||
<-promise.done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for waiting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not necessary any longer, have to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it shouldn't be necessary any longer (it was in a previous revision).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after removing this line I get a data race in tests. Not sure why yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to read from promise.done
here is that otherwise the background goroutine will block on sending to the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then we're holding the whole request until the PostingsForMatchers finishes. Can we have a select
with default
or with case <-ctx.Done()
when sending to promise.done
? Then we can immediately return the cancelled request and let the background goroutine clean up when it's ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can solve the blocking on promise.done
by moving the call to c.created
to the goroutine, and promise.done
is used only to indicate completion (see comment below) - then promise.done
is only ever closed, and never sent to.
tsdb/postings_for_matchers_cache.go
Outdated
return nil, ctx.Err() | ||
case <-p.done: | ||
case <-promise.done: | ||
// Checking context error is necessary for deterministic tests, | ||
// as channel selection order is random | ||
if ctx.Err() != nil { | ||
return nil, ctx.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but let's warp the error with something. Chasing plain context canceled
isn't fun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment applies above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL. I also added a cause to the cancellation, to ease diagnosis.
tsdb/postings_for_matchers_cache.go
Outdated
for elem := c.cached.Front(); elem != nil; elem = elem.Next() { | ||
cc := elem.Value.(*postingsForMatchersCachedCall) | ||
if cc.key == key { | ||
c.cached.Remove(elem) | ||
c.cachedBytes -= cc.sizeBytes | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this can be slow. Is there a way to make sure we don't insert it into the linked list if all the contexts are cancelled?
perhaps have a channel to which the producer writes an empty struct when it's done with producing the result. Any consumer will take this empty struct and commit to consuming the result. If there is no consumer waiting on the channel, then the producer just discards the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually more or less the same idea came to when reviewing the code this morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried implementing this idea @dimitarvdimitrov, by sending the cache entry size over promise.done
for a consumer to pick up and commit, could you have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a complex change - how necessary is it that we cancel postings for matchers calls if the caller cancels? (How expensive are these calls?) And how often does the caller cancel?
I'm wondering if we can just accept that occasionally we'll waste some cycles completing a postings for matchers call for a caller that has already cancelled.
@charleskorn as discussed offline, I see your concern, and I think we'll need to analyze the necessity of being able to cancel executing queries in the light of past issues at Grafana Labs related to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is WIP, but I'd suggest adding some unit tests for the different cases as a next step, to help flush out some of the subtle bugs that might be introduced before we refactor and address other issues we've noticed.
if sizeBytes > 0 { | ||
// Got the promise cache entry's size, store it | ||
c.created(key, c.timeNow(), sizeBytes) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the call to c.created
be made in the goroutine spawned by PostingsForMatchers
? Then done
can simply be closed to signal that the promise has completed
c.calls[key] = promise | ||
c.cachedMtx.Unlock() | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to its own method on postingsForMatchersPromise
perhaps? This method is starting to get quite big.
|
||
// There are no more waiting goroutines, cancel the promise and remove it from the cache | ||
promise.cancel(fmt.Errorf("no remaining callers interested in query")) | ||
delete(c.calls, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right - calling delete
here will remove the promise from the cache if a subsequent call's context is cancelled, even if the promise succeeded. Imagine this sequence of events:
- Initial call for a set of matchers arrives, promise succeeds and value is cached and returned.
- Subsequent call arrives with a context that is already cancelled. Cached promise is retrieved from cache,
waiting
is incremented to 1 and thenwaitOnPromise
is called. waitOnPromise
observes cancelled context, decrementswaiting
to 0 and removes cached promise from cache.
Instead, we only want to cancel and remove the call from c.calls
if there's nothing waiting and the call is still in progress.
c.cachedMtx.Unlock() | ||
|
||
// Wait for query execution goroutine to finish | ||
<-promise.done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can solve the blocking on promise.done
by moving the call to c.created
to the goroutine, and promise.done
is used only to indicate completion (see comment below) - then promise.done
is only ever closed, and never sent to.
…ation Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
f1040da
to
9ac9d68
Compare
Superseded by #644 |
Make
PostingsForMatchersCache.PostingsForMatchers
respect context cancellation when executing queries (to populate cache), by executing them in a background goroutine taking a dedicated context, which gets canceled once noPostingsForMatchers
caller needs it any more.Fixes #548.
TODO: