Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA-IPC: Fix MNNVL reachability check #10198

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

brminich
Copy link
Contributor

@brminich brminich commented Oct 1, 2024

What

Fix MNNVL reachability check

@brminich brminich force-pushed the uct/fix_cuda_ipc_mnnvl_reach branch 2 times, most recently from 4f0b8a1 to 2dbaf37 Compare October 3, 2024 13:19
@brminich brminich force-pushed the uct/fix_cuda_ipc_mnnvl_reach branch from 2dbaf37 to d670af0 Compare October 9, 2024 14:46
@brminich brminich marked this pull request as ready for review October 10, 2024 13:17
src/uct/api/v2/uct_v2.h Outdated Show resolved Hide resolved
src/uct/cuda/base/cuda_md.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
config/m4/cuda.m4 Show resolved Hide resolved
src/uct/api/v2/uct_v2.h Outdated Show resolved Hide resolved
src/uct/cuda/base/cuda_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
yosefe
yosefe previously approved these changes Oct 31, 2024
yosefe
yosefe previously approved these changes Nov 6, 2024
out:
#endif
if ((mnnvl_enable == UCS_YES) && !mnnvl_supported) {
ucs_error("multi-node NVLINK support is requested but not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: adapt message to show when it's build that does not support it?

tvegas1
tvegas1 previously approved these changes Nov 6, 2024
cu_device));
if (status != UCS_OK) {
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){
uct_iface_fill_info_str_buf(params, "clique id doesn't match");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich The main concern with reachability check I have here is that device 0 determines reachability for the eventual device that this thread will use during transfer/progress time. Devices on the same node aren't required to be part of the same cluster or clique even though that is most likely true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is your suggestion?

Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are implications for multi-device support if we go with the current logic.

CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES,
cu_device));
if (status != UCS_OK) {
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId) {


/**
* Device address length. If not provided, the transport will assume a
* default minimal length according to the address buffer contents.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use minimum.

/**
* Check whether the platform is coherent.
*
* @return 1 if coherent, or 0 otherwise.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be just , otherwise 0.

* Device address length. If not provided, the transport will assume a
* default minimal length according to the address buffer contents.
*/
size_t device_addr_length;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check format code stage output. There are relevant issues.

@brminich brminich dismissed stale reviews from tvegas1 and yosefe via 6e9f239 November 15, 2024 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants