Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating the nativelink by bazel test doesn't work on local Ubuntu machine #664

Closed
steedmicro opened this issue Feb 15, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@steedmicro
Copy link
Contributor

steedmicro commented Feb 15, 2024

Currently, evaluating nativelink by bazel test doesn't work on local Ubuntu machine.
I've confirmed this issue on my Linux VPS. At first, I thought it was because of the VPS performance but soon I realized that it was because of the bug in our project.

FYI, bazel version was upgraded from 6.4.0 to 7.0.0 by this PR - Migrate to Bzlmod (#626)

Currently, with bazel 7.0.0, evaluating nativelink by bazel test fails like this.

7 0 0_not_working

When I downgrade it to bazel 6.4.0, it fails like this. It seems like bazel 6.4.0 is not working anymore on our codebase.

6 4 0_not_working

But when I reset the code to the codebase just before that Migrate to Bzlmod PR and downgrade bazel to 6.4.0, it works well.

previous_working

It seems like while migrating to Bzlmod, we caused issue so that it fails to run bazel test locally.

Plus, on the documentation it says bazel requirement is Bazel 6.4.0+ even though the .bazelversion file is 7.0.0 . Documentation should be updated as Bazel 7.0.0+.

* Bazel 6.4.0+

@steedmicro
Copy link
Contributor Author

I hope you to leave your opinion, here. @aaronmondal . Thanks.

@aaronmondal
Copy link
Member

That a downgrade to 6.4.0 doesn't work is somewhat expected. There were a lot of changes necessary to get bzlmod working and the Bazel 7 stuff and bzlmod are more or less dependent on each other. So I don't think that the Bzlmod setup will work with Bazel 6.4.

The error you're getting is new to me. We still run integration tests in CI against Ubuntu runners, even more than before the bzlmod changes. My initial guess would be that maybe you're using an Ubuntu older than 22.04? Backwards compatibility with older Ubuntu versions is now done via the --config=linux_zig:

nativelink/.bazelrc

Lines 75 to 90 in 59d3d28

# Option to test the zig toolchain on Linux. Prefer the default `linux`
# toolchain which builds cc targets roughly twice as fast.
#
# WARNING:
#
# We're using an incredibly old target glibc here. Builds created with this
# toolchain have maximum compatibility (theoretically down to Ubuntu 18), but
# miss out on half a decade of optimizations. Don't use this for production
# builds if you're running a non-ancient OS and care about performance.
#
# TODO(aaronmondal): Migrate to a statically linked musl as soon as rules_rust
# supports it. This way we get to keep (or even improve)
# backwards compatibility without sacrificing performance.
build:linux_zig --host_platform=@zig_sdk//platform:linux_amd64
build:linux_zig --extra_toolchains=@zig_sdk//libc_aware/toolchain:linux_amd64_gnu.2.28
build:linux_zig --repo_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1

That config pins the cc toolchain to a toolchain that works with older versions of Ubuntu.

Example for non-remote running unit tests:

bazel test --config=linux_zig //... --verbose_failures

Building NativeLink against itself using an Ubuntu 20 host:

- name: Compile NativeLink with NativeLink

Something I'm noticing though is that we don't seem to run the testsuite against that remote configuration on an older Ubuntu. We might want to add some bazel test invocation to that run as well.

If the issues persist even with --config=linux_zig then this looks like an issue with rules_rust.

@steedmicro
Copy link
Contributor Author

Thanks for your answers, @aaronmondal .
But I've just checked that my Ubuntu machine is 22.04.
And even with --config=linux_zig, it didn't work.

Capture

@aaronmondal
Copy link
Member

Hmm ok I'm suspecting some kind of cache mismatch or some wrong cache reuse triggered by the external dependencies built by rules_rust. The "discarding cache" warning seems to suggest that there are some other artifacts already present. Maybe those are from a previous build and somehow interfere with each other. IIRC rules_rust had irreproducibility issues not too long ago. Maybe this is an instance of that.

Could you run this build again with --verbose_failures and another one where the entire bazel directory ~/.cache/bazel/_bazel_xxx/ is removed?

I'll also try to reproduce over the weekend.

@aaronmondal
Copy link
Member

Ah and one other thing that could be related: What happens if this build is run as a non-root user? This could have impact on some temporary directories/file reads during implicit cargo invocations.

@aaronmondal aaronmondal added the bug Something isn't working label Feb 16, 2024
@allada
Copy link
Member

allada commented Feb 16, 2024

Try bazel clean --expunge

@steedmicro
Copy link
Contributor Author

Thanks for your opinions.

FYI, I've tried on non-root user which I've created but the result was the same, cc: @aaronmondal .

Capture

Plus, I've already tried bazel clean --expunge and retried also, but it didn't work either. cc: @allada .

Below, I'm showing the screenshot of command execution result having --verbose_failures flag.

Capture

@steedmicro
Copy link
Contributor Author

steedmicro commented Feb 23, 2024

This is is fixed in this PR - #669.

cc: @aaronmondal , @allada , @MarcusSorealheis .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants