RoCE LAG and bond device #9849
Replies: 3 comments
-
UCX can get full BW (2x25Gbs=50Gbs) when using wither UCX_NET_DEVICES=mlx5_bond_0:1 or UCX_NET_DEVICES=mlx5_0:1,mlx5_1. Creating a RoCE LAG bond device is not required, and both modes can be used. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the information @yosefe ! For RoCE LAG case (where UCX_NET_DEVICES=mlx5_bond_0:1 is configured), Could you please opint me where this function taking place? I search code ,it seems related to uct_ib_iface_t's num_paths attribute, but I only found its usage in dc transport type, not in rc. Does it mean full BW can only be got in dc transport? Please correct me. thanks!
Is there any chance to let the hardware to take care about the automatic failover after connection establishment (thus during normal network traffic operations) ? I am using 802.3ad bond now for LAG, may be changing bond mode to active-back mode (mode 1) can do hardware level fail over? |
Beta Was this translation helpful? Give feedback.
-
According to my log, it seems it found the LAG paths (printed as mlx5_bond_1:1.0 and mlx5_bond_1:1.1) and utilized both of them to send rndv traffic . [1714293786.896837] [host-192-168-20-64:1929614:0] wireup.c:1087 UCX DEBUG ep 0x7f7757023000: lane[0]: 1:rc_mlx5/mlx5_bond_1:1.0 md[0] -> addr[1].md[0]/ib/sysdev[255] rma_bw#0 am am_bw#0 @yosefe could you please help take a look, does this mean that if one port down during sending msg, all paths will fail ,thus, we lost bond automatic failover ability? IFAIK, RoCE bonding such as 802.3ad can migrate QPs automaticall from failed port to new active port in H/W. It should be transparent to the application (UCX) right? Did I miss anything? Thanks! |
Beta Was this translation helpful? Give feedback.
-
According to https://docs.nvidia.com/networking/display/hpcxv218/unified+communication+-+x+framework+library#src-2568379055_UnifiedCommunicationXFrameworkLibrary-rocelag, which says "UCX is now able to detect a RoCE LAG device and automatically create two RDMA connections to utilize the full bandwidth of LAG interface." Does that mean we do not need create bond device,thus do not pass mlx5_bond_0:1 but pass mlx5_0:1 and mlx5_1:1 to UCX ? then we can get full bandwidth for example two 25Gb port make a 50Gb bandwidth?
If that is true,then how about the fail over mechanisms provided by bond device such as bond4 802.3ad ? Will we lost fail over ability?
We generally do not want to lose fail over ability, does this means we have to create bond device and pass mlx5_bond_0:1 to UCX and disable the out-of-box RoCE function in some way?
Beta Was this translation helpful? Give feedback.
All reactions