-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max concurrent GrpcStore streams #656
Max concurrent GrpcStore streams #656
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 8 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained
a discussion (no related file):
nit: It doesn't fix the issue, it is towards
that issue.
nativelink-config/src/schedulers.rs
line 139 at r1 (raw file):
/// Limit the number of simultaneous upstream requests to this many. A /// value of zero is treated as unlimited.
nit: It might be good to document what happens if we exceed the limit (ie: do we backpressure/queue or drop).
nativelink-config/src/stores.rs
line 551 at r1 (raw file):
/// Limit the number of simultaneous upstream requests to this many. A /// value of zero is treated as unlimited.
ditto.
nativelink-scheduler/src/grpc_scheduler.rs
line 76 at r1 (raw file):
fn on_error(&self, err: &Error) { if err.code != Code::Internal {
nit: Maybe warn!()
the error?
nativelink-scheduler/src/grpc_scheduler.rs
line 76 at r1 (raw file):
fn on_error(&self, err: &Error) { if err.code != Code::Internal {
Should we use some kind of map of codes to know if we should reconnect?
nativelink-store/src/grpc_store.rs
line 223 at r1 (raw file):
/// This handles the permit for limiting concurrency, and also re-connecting the /// underlying channel on error. It depends on users reporting all errors. struct Connection<'a> {
nit: This looks nearly identical to the one above in grpc_scheduler
. Can we combine them w/ templates and put them into grpc_utils
or something?
02dd17b
to
3f341d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-config/src/stores.rs
line 551 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
ditto.
Done.
nativelink-scheduler/src/grpc_scheduler.rs
line 76 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Maybe
warn!()
the error?
That's the purpose of the upstream really.
nativelink-scheduler/src/grpc_scheduler.rs
line 76 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Should we use some kind of map of codes to know if we should reconnect?
I don't think we need to, all of the transport errors are mapped to Internal
within Tonic: https://github.com/hyperium/tonic/blob/master/tonic/src/status.rs#L376
Notably, it states that these indicate an issue with the library: https://github.com/hyperium/tonic/blob/master/tonic/src/status.rs#L282
3f341d3
to
ef4ba39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 7 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-util/src/grpc_utils.rs
line 91 at r2 (raw file):
// new connection now. let mut channel_lock = self.parent.channel.lock(); // Channel already changed, don't do it again.
nit: Can we elaborate this a bit... it took me a while to understand this...
// This ensures only the first error reconnects and the other reconnect
// attempts use the Channel that the first one recreated.
nativelink-util/src/grpc_utils.rs
line 92 at r2 (raw file):
let mut channel_lock = self.parent.channel.lock(); // Channel already changed, don't do it again. if channel_lock.0 != self.channel_id {
I'm pretty sure you need to do this on all channels after they are updated:
self.channel_id = channel_lock.0
(sadly you'll need another mutex :-( )
Failing to do so will cause the second-reconnect attempt to result in get_channel
always resulting in a bad channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-util/src/grpc_utils.rs
line 92 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I'm pretty sure you need to do this on all channels after they are updated:
self.channel_id = channel_lock.0(sadly you'll need another mutex :-( )
Failing to do so will cause the second-reconnect attempt to result in
get_channel
always resulting in a bad channel.
No, a Connection
should only be used once. I really wanted to make this easier by impl tower::Service<http::HttpBody> for Connection
but that would involve adding a dependency to tower
.
Essentially, we create a new Channel
in the parent ConnectionManager
on failure, but any existing Connection
that's being used can complete (and perhaps error) by itself. This is done because in the case of GoAway
the other established streams on the Channel
are able to complete, but we need to terminate after that graceful disconnect has finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-util/src/grpc_utils.rs
line 92 at r2 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
No, a
Connection
should only be used once. I really wanted to make this easier byimpl tower::Service<http::HttpBody> for Connection
but that would involve adding a dependency totower
.Essentially, we create a new
Channel
in the parentConnectionManager
on failure, but any existingConnection
that's being used can complete (and perhaps error) by itself. This is done because in the case ofGoAway
the other established streams on theChannel
are able to complete, but we need to terminate after that graceful disconnect has finished.
Can we add some of this to a comment? This seems like something that we will foot-gun ourselves on in the future if we don't document it.
Add a configuration option to limit the number of concurrent streams that are active with the upstream GrpcStore. With this, we also introduce reconnection on error since Tonic doesn't do that.
ef4ba39
to
935aac5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, Vercel, asan / ubuntu-22.04, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04
nativelink-util/src/grpc_utils.rs
line 92 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Can we add some of this to a comment? This seems like something that we will foot-gun ourselves on in the future if we don't document it.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04 / stable, windows-2022 / stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 1 of 1 LGTMs obtained
Description
Add a configuration option to limit the number of concurrent streams that are active with the upstream GrpcStore.
With this, we also introduce reconnection on error since Tonic doesn't do that.
Towards #602
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please also list any relevant details for your test configuration
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is