Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester.QueryStream: Add support for ignoring context cancellation for chunk queriers #6408

Merged
merged 15 commits into from
Oct 19, 2023

Conversation

aknuds1
Copy link
Contributor

@aknuds1 aknuds1 commented Oct 17, 2023

What this PR does

In Ingester.QueryStream, if experimental flag -ingester.chunks-query-ignore-cancellation is enabled, ignore context cancellation wrt. chunk queriers, the way it worked prior to #6085. If the context is canceled though (or the error returned from Prometheus stems from a context error), emit a debug log message/trace span with full diagnostics for troubleshooting.

This is meant as a temporary "feature", while we troubleshoot unexpected Ingester.QueryStream cancellations.

I attempted adding a test, but realized that if I call Ingester.QueryStream with a canceled stream context, the method fails independently of the querier since responding via the stream will fail. Therefore writing a test is really tricky, and would take some time (would need some sort of stubbing most likely).

Which issue(s) this PR fixes or relates to

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@aknuds1 aknuds1 added bug Something isn't working component/ingester labels Oct 17, 2023
@aknuds1 aknuds1 changed the title Ingester.QueryStream: Ignore context cancellation for chunk queriers WIP: Ingester.QueryStream: Ignore context cancellation for chunk queriers Oct 17, 2023
@aknuds1 aknuds1 changed the title WIP: Ingester.QueryStream: Ignore context cancellation for chunk queriers Ingester.QueryStream: Ignore context cancellation for chunk queriers Oct 17, 2023
@aknuds1 aknuds1 force-pushed the arve/chunk-query-ignore-ctx branch 2 times, most recently from fb959b4 to 734d491 Compare October 17, 2023 17:23
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Copy link
Contributor

@treid314 treid314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as a test just to allow us to see that original error, but wouldn't the underlying query still fail?

@aknuds1
Copy link
Contributor Author

aknuds1 commented Oct 18, 2023

I think this is fine as a test just to allow us to see that original error, but wouldn't the underlying query still fail?

@treid314 you're right, I actually realized the same thing last night. The drawback is that we don't get at the original error, unless it's from the stream context, but we don't want to fail the query either. I will revert to passing an independent context (context.Background()) instead. I will also put a flag around this behaviour, and make sure there are tests.

In Ingester.QueryStream, ignore context cancellation wrt. chunk
queriers, the way it worked prior to
#6085. If the context is canceled
though, log it with full diagnostics.

Signed-off-by: Arve Knudsen <[email protected]>
@@ -1625,6 +1627,8 @@ func (i *Ingester) QueryStream(req *client.QueryRequest, stream client.Ingester_
spanlog, ctx := spanlogger.NewWithLogger(stream.Context(), i.logger, "Ingester.QueryStream")
defer spanlog.Finish()

deadline, deadlineSet := ctx.Deadline()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I would move it before the call to dumpContextError()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it into dumpContextError since I don't see any downside to doing so (and the code gets simpler).

} else {
timeout = "not set"
}
level.Warn(spanlog).Log("msg", "query context error", "cause", context.Cause(ctx), "timeout", timeout,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The querier legitimately cancel the context once it reaches the quorum (e.g. send request to all ingester zones, cancel outstanding requests once reached successful answers from 2 out of 3 zones). I think this will generate a lot of noise in the logs. I suggest you to to put this debug info only in the tracing span (and not log it as well), if your purpose will be to investigate it through tracing anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit difficult to say before we try in practice (I'm more used to logs than traces), but obtaining this info from traces is probably enough.

Copy link
Contributor Author

@aknuds1 aknuds1 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci I've changed to level.Debug instead, since that what happens elsewhere in Ingester.QueryStream, and it should cause only a trace to be emitted unless debug logs are enabled. That should be sufficient, or would it be too noisy even for debug logs?

@@ -2006,7 +2030,7 @@ func (i *Ingester) sendStreamingQuerySeries(ctx context.Context, q storage.Chunk

// Ensure no error occurred while iterating the series set.
if err := ss.Err(); err != nil {
return nil, 0, err
return nil, 0, errors.Wrap(err, "iterating ChunkSeriesSet")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think calling it "iterating" is a bit misleading. I understand this is coming from the comment above, but in practice what's happening here is that we're "sending" not "iterating".

Same comment applies above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't actually looking at the comment, I was just relating to the fact that ss.Err() returns any error having occurred during iteration of the ss iterator. I don't understand how "sending" would be more descriptive, since any error from ss.Err() would be from failed ChunkSeries iteration, not sending. A failed send should result in an error returned from client.SendQueryStream, no?

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@aknuds1 aknuds1 marked this pull request as ready for review October 19, 2023 09:43
@aknuds1 aknuds1 requested review from a team as code owners October 19, 2023 09:43
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@aknuds1 aknuds1 added enhancement New feature or request and removed bug Something isn't working labels Oct 19, 2023
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the understanding that this will be removed as soon as we understand the context cancelations better

@aknuds1 aknuds1 changed the title Ingester.QueryStream: Ignore context cancellation for chunk queriers Ingester.QueryStream: Add support for ignoring context cancellation for chunk queriers Oct 19, 2023
@aknuds1 aknuds1 merged commit f737a96 into main Oct 19, 2023
30 checks passed
@aknuds1 aknuds1 deleted the arve/chunk-query-ignore-ctx branch October 19, 2023 12:09
grafanabot pushed a commit that referenced this pull request Oct 19, 2023
…or chunk queriers (#6408)

* Ingester.QueryStream: Add support for ignoring context cancellation for chunk queriers

In Ingester.QueryStream, support ignoring of context cancellation wrt. chunk
queriers, the way it worked prior to
#6085. If the context is canceled
though, (span) log it with full diagnostics.

Also wrap some chunk querying errors.

---------

Signed-off-by: Arve Knudsen <[email protected]>
(cherry picked from commit f737a96)
aknuds1 added a commit that referenced this pull request Oct 19, 2023
…or chunk queriers (#6408) (#6440)

* Ingester.QueryStream: Add support for ignoring context cancellation for chunk queriers

In Ingester.QueryStream, support ignoring of context cancellation wrt. chunk
queriers, the way it worked prior to
#6085. If the context is canceled
though, (span) log it with full diagnostics.

Also wrap some chunk querying errors.

---------

Signed-off-by: Arve Knudsen <[email protected]>
(cherry picked from commit f737a96)

Co-authored-by: Arve Knudsen <[email protected]>
@dimitarvdimitrov
Copy link
Contributor

since we found the root cause to be what grafana/mimir-prometheus#546 fixed, can we revert this PR @aknuds1 ?

@aknuds1
Copy link
Contributor Author

aknuds1 commented Oct 27, 2023

@dimitarvdimitrov I was planning on reverting this next week (have a reminder set), after we've tested that fix for a bit. Sound good?

aknuds1 added a commit that referenced this pull request Nov 3, 2023
…lation for chunk queriers (#6408)"

This reverts commit f737a96.

Signed-off-by: Arve Knudsen <[email protected]>
aknuds1 added a commit that referenced this pull request Nov 3, 2023
…lation for chunk queriers (#6408)" (#6555)

This reverts commit f737a96.

Signed-off-by: Arve Knudsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants