Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/PROTO/MULTI: Allow header be greater than max_frag #10263

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Oct 28, 2024

What

This is a fix for RM#4134043, which is a regression introduced in #9920

This issue happens when MAX_EAGER_LANES > 1

Why ?

In multi-fragment protocols we set max_frag per lane proportionally to it's bandwidth. However there is a single AM lane that could get max_frag assigned less than a maximum header size. This provokes an assertion

How ?

  1. Remove an assertion checking that header size must be smaller than max_frag
  2. If header is greater than max_frag - that's ok, we allow it, but we send only the header data and keep the payload empty

@iyastreb iyastreb force-pushed the ucp-proto-multi-force-header-send branch from 5b5b1e3 to ce3f490 Compare October 28, 2024 15:33
@iyastreb
Copy link
Contributor Author

/azp run

@brminich
Copy link
Contributor

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, it seems 2 separate PRs are worth creating, because you are addressing two separate issues.

@@ -59,8 +73,7 @@ ucs_status_t ucp_proto_multi_init(const ucp_proto_multi_init_params_t *params,
&params->super, params->first.lane_type, params->first.tl_cap_flags,
1, 0, lanes);
if (num_lanes == 0) {
ucs_trace("no lanes for %s",
ucp_proto_id_field(params->super.super.proto_id, name));
ucs_trace("no lanes for %s", perf_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the same I'd keep proto_name here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

/* If type of the first lane is not the same as in the middle, then
* exclude this lane from sorting */
lane = (params->first.lane_type == params->middle.lane_type) ? 0 : 1;
/* Sort lanes array by lanes_perf bandwidth */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already similar comment on line 101

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@iyastreb
Copy link
Contributor Author

imo, it seems 2 separate PRs are worth creating, because you are addressing two separate issues.

Ok, will split them

@iyastreb iyastreb changed the title UCP/PROTO/MULTI: Allow header be greater than max_frag, sort lanes by BW UCP/PROTO/MULTI: Allow header be greater than max_frag Oct 30, 2024
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this for v1.18.x?

lpriv->max_frag, hdr_size);
/* If header is greater than max_frag - that's ok, we allow it, but we send
* only the header data and keep the payload empty */
if (lpriv->max_frag <= hdr_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe smth like

ssize_t max_frag;
...
max_frag = lpriv->max_frag - hdr_size;
if (max_frag < 0) {
     return 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good idea

Copy link
Contributor Author

@iyastreb iyastreb Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I'm reverting this change, because signed type ssize_t has only half width of size_t and therefore cannot be reliably used to hold the difference between two size_t variables. The failure is exposed when running test_ucp_rma.put_blocking tests, where lpriv->max_frag is almost equal to SIZE_MAX and hdr_size=0. In that case the difference = SIZE_MAX, but it becomes negative when casted to ssize_t that leads to test failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR lpriv->max_frag can be set to lane_perf->max_frag which is a transport limitation, so may there be a situation when some lanes don't really support such big header size so we cannot send even header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see in ucp_worker_set_max_am_header the worker->max_am_header is set so that header can fit in any lane. If user sends something larger than that, then we reject this input:

ucp_am_send_nbx_check_header_length(ucp_worker_h worker, size_t header_length)
{
    if (ENABLE_PARAMS_CHECK && (header_length > worker->max_am_header)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, better not calculate max_frag = lpriv->max_frag - hdr_size during variable declaration, but do it after checking (lpriv->max_frag <= hdr_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@iyastreb
Copy link
Contributor Author

do we need this for v1.18.x?

@yosefe I'm not sure if we need to backport this to 1.18, because this issue happens only with MAX_EAGER_LANES > 1, which is not default config

@brminich
Copy link
Contributor

can you laso remove UD restrictions from
https://github.com/openucx/ucx/blob/master/test/gtest/ucp/test_ucp_am.cc#L1588
and https://github.com/openucx/ucx/blob/master/test/gtest/ucp/test_ucp_am.cc#L1599
as these tests are supposed to pass with this fix

@brminich
Copy link
Contributor

do we need this for v1.18.x?

@yosefe I'm not sure if we need to backport this to 1.18, because this issue happens only with MAX_EAGER_LANES > 1, which is not default config

imo, no need to backport

@iyastreb iyastreb closed this Oct 31, 2024
@iyastreb iyastreb reopened this Oct 31, 2024
brminich
brminich previously approved these changes Nov 1, 2024
Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, if no issues are found with re-enabled tests

Comment on lines 69 to 74
/* If header is greater than max_frag - that's ok, we allow it, but we send
* only the header data and 1 byte of payload.
* 1 byte is needed to distinguish between first and middle fragments that
* we do by comparing req->send.state.dt_iter.offset with 0.
* @see ucp_worker_set_max_am_header
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we decided that no need to backport this fix to v1.18.x, how hard it will be to fix according to

         * TODO: fix generic AM based multi-fragment protocols, so that this
         * trick is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@iyastreb
Copy link
Contributor Author

iyastreb commented Nov 4, 2024

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Comment on lines 2364 to +2366
max_uct_fragment = ucs_max(if_attr->cap.am.max_bcopy,
max_ucp_header - 1) -
max_ucp_header - 1;
max_ucp_header) -
max_ucp_header;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, seems just a little more readable for me:

            max_uct_fragment = ucs_max(if_attr->cap.am.max_bcopy - 
                                       max_ucp_header, 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more readable, but not always correct. What if max_ucp_header > if_attr->cap.am.max_bcopy?
We have to be careful when deducting one size_t from the other, because we can easily get a "negative" result which translates into some huge number. I guess it exists in this form for exactly this reason - to always guarantee the correctness

@@ -265,11 +265,6 @@ size_t ucp_datatype_iter_iov_next_iov(const ucp_datatype_iter_t *dt_iter,
ucs_assertv(length <= max_length, "length=%zu max_length=%zu", length,
max_length);

/* Check that if not reached the end, packed data is not empty */
ucs_assertv((dt_iter->offset == dt_iter->length) || (length > 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true that all middle fragments still should always send some payload? Maybe somehow save this check for middle segments only (e.g. check offset)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true
To have this assert I would need to propagate request (or at least request flags/boolean) to this function and to all the functions calling it (ucp_datatype_iter_next_iov refs in 7 other files). If it's really important to have this check, we can do that.
@yosefe - since you added this assertion, do we need to keep it even at the cost of adding one more param to this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it is necessary to add request argument here. We can pass only information that is needed and not to this function but to ucp_datatype_iter_next_iov. Also maybe we can do that even without passing request at all by relaxing this assertion (E.g. check that only if offset > 0) WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I add relaxed check, still maybe useful

lpriv->max_frag, hdr_size);
/* If header is greater than max_frag - that's ok, we allow it, but we send
* only the header data and keep the payload empty */
if (lpriv->max_frag <= hdr_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR lpriv->max_frag can be set to lane_perf->max_frag which is a transport limitation, so may there be a situation when some lanes don't really support such big header size so we cannot send even header?

@@ -1584,9 +1584,8 @@ class test_ucp_am_nbx_dts : public test_ucp_am_nbx_reply {

/* Skip tests for ud_v and ud_x because of unstable reproducible failures during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks

@@ -266,7 +266,8 @@ size_t ucp_datatype_iter_iov_next_iov(const ucp_datatype_iter_t *dt_iter,
max_length);

/* Check that if not reached the end, packed data is not empty */
ucs_assertv((dt_iter->offset == dt_iter->length) || (length > 0),
ucs_assertv((dt_iter->offset == 0) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls update the comment above and mention new check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed to update the assert?
maybe we should skip this function altogether if there is no room for data, or exit this function early if max_length==0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok
We can't easily skip this function, because it updates the next iteration state
For now I just exit from the function earlier

@@ -452,7 +452,8 @@ class test_proto_reset : public ucp_test {
{
const ucp_datatype_iter_t *dt_iter = &req->send.state.dt_iter;

return !ucp_datatype_iter_is_begin(dt_iter) &&
return ((req->flags & UCP_REQUEST_FLAG_MIDDLE_FRAGMENT) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ucp_proto_multi_is_middle_fragment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks logical, it was my initial impl, but then I removed it, because including "proto_multi.inl" into C++ translation unit requires quite some unrelated changes in that file: see my reverted commit which fixes some of them: e74c74a

So I thought it's not worth it.
Else we can move ucp_proto_multi_is_middle_fragment function to proto_multi.h

@@ -78,6 +78,7 @@ static size_t ucp_am_eager_multi_bcopy_pack_args_first(void *dest, void *arg)

ucs_assertv(req->send.state.dt_iter.offset == 0, "offset %zu",
req->send.state.dt_iter.offset);
ucs_assert(ucp_proto_multi_is_first_fragment(req));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the assertion at line 79?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is complementary check, which is still eligible IMO

@@ -44,6 +44,7 @@ static const char *ucp_request_flag_names[] = {
[ucs_ilog2(UCP_REQUEST_FLAG_RECV_TAG)] = "rcv_tag",
[ucs_ilog2(UCP_REQUEST_FLAG_RKEY_INUSE)] = "rk_use",
[ucs_ilog2(UCP_REQUEST_FLAG_USER_HEADER_COPIED)] = "hdr_copy",
[ucs_ilog2(UCP_REQUEST_FLAG_MIDDLE_FRAGMENT)] = "mid_frag",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO a global request flag is misleading because it's only for AM send protocol.
Maybe add something in req->send.msg_proto.am?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -254,7 +254,7 @@ static UCS_F_ALWAYS_INLINE ucs_status_t ucp_am_eager_multi_zcopy_send_func(
UCS_STATIC_ASSERT(sizeof(hdr.first) == sizeof(ucp_am_hdr_t));
UCS_STATIC_ASSERT(sizeof(hdr.middle) == sizeof(ucp_am_hdr_t));

if (req->send.state.dt_iter.offset == 0) {
if (ucp_proto_multi_is_first_fragment(req)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this function should be active-message specific (as we see, it's not used in other mult-frag protocols such as tag/eager). so move it to proto_am.inl and rename to ucp_proto_am_is_first_fragment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -266,7 +266,8 @@ size_t ucp_datatype_iter_iov_next_iov(const ucp_datatype_iter_t *dt_iter,
max_length);

/* Check that if not reached the end, packed data is not empty */
ucs_assertv((dt_iter->offset == dt_iter->length) || (length > 0),
ucs_assertv((dt_iter->offset == 0) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed to update the assert?
maybe we should skip this function altogether if there is no room for data, or exit this function early if max_length==0?

@@ -806,7 +807,8 @@ void ucp_proto_request_restart(ucp_request_t *req)
}

/* Select a protocol with resume request support */
if (!ucp_datatype_iter_is_begin(&req->send.state.dt_iter)) {
if (ucp_proto_multi_is_middle_fragment(req) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is active-message specific, so maybe the protocol "reset" callback should provide the indication of whether a "RESUME" flags is going to be required or not, instead of just deciding it according to ucp_datatype_iter_is_begin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the scope of the changes to AM eager multi, and I don't see test failures in this function

Comment on lines 44 to 50
ucp_proto_multi_is_middle_fragment(const ucp_request_t *req)
{
return req->flags & UCP_REQUEST_FLAG_MIDDLE_FRAGMENT;
}

static UCS_F_ALWAYS_INLINE int
ucp_proto_multi_is_first_fragment(const ucp_request_t *req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need both wrappers? just for the inverse check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removing the middle one

lpriv->max_frag, hdr_size);
/* If header is greater than max_frag - that's ok, we allow it, but we send
* only the header data and keep the payload empty */
if (lpriv->max_frag <= hdr_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, better not calculate max_frag = lpriv->max_frag - hdr_size during variable declaration, but do it after checking (lpriv->max_frag <= hdr_size)

/**
* AM specific internal request flags
*/
enum ucp_request_am_impl_flags {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define it in proto_am.inl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

brminich
brminich previously approved these changes Nov 14, 2024
* AM specific internal request flags
*/
enum ucp_request_am_impl_flags {
UCP_REQUEST_AM_FLAG_MIDDLE_FRAGMENT = UCS_BIT(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:

Suggested change
UCP_REQUEST_AM_FLAG_MIDDLE_FRAGMENT = UCS_BIT(0)
UCP_REQUEST_FLAG_AM_MIDDLE_FRAGMENT = UCS_BIT(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants