-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA-IPC: Fix MNNVL reachability check #10198
base: master
Are you sure you want to change the base?
Conversation
4f0b8a1
to
2dbaf37
Compare
2dbaf37
to
d670af0
Compare
d670af0
to
7e2b0ec
Compare
out: | ||
#endif | ||
if ((mnnvl_enable == UCS_YES) && !mnnvl_supported) { | ||
ucs_error("multi-node NVLINK support is requested but not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: adapt message to show when it's build that does not support it?
cu_device)); | ||
if (status != UCS_OK) { | ||
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){ | ||
uct_iface_fill_info_str_buf(params, "clique id doesn't match"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich The main concern with reachability check I have here is that device 0 determines reachability for the eventual device that this thread will use during transfer/progress time. Devices on the same node aren't required to be part of the same cluster or clique even though that is most likely true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are implications for multi-device support if we go with the current logic.
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, | ||
cu_device)); | ||
if (status != UCS_OK) { | ||
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId){ | |
if (dev_addr->mnnvl_addr.clique_id != md->fabric_info.cliqueId) { |
|
||
/** | ||
* Device address length. If not provided, the transport will assume a | ||
* default minimal length according to the address buffer contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use minimum
.
/** | ||
* Check whether the platform is coherent. | ||
* | ||
* @return 1 if coherent, or 0 otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be just , otherwise 0.
* Device address length. If not provided, the transport will assume a | ||
* default minimal length according to the address buffer contents. | ||
*/ | ||
size_t device_addr_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check format code stage output. There are relevant issues.
What
Fix MNNVL reachability check