Replies: 10 comments 3 replies
-
AWS is interested in keeping triggered ops. HPE does use it. Believe it is useful and worth keeping. Ian will check on GNI use. EFA is interested in FI_MORE and FI_SOURCE_ERR. CXI will support both. FI_SOURCE_ERR has use case with Mercury. No known use of fi_log_subsys - could drop. OR... add a getinfo subsys to filter/select getinfo output. FI_LOG_TRACE - currently only used by gni provider. OPX does use FI_LOG_TRACE via FI_DBG_TRACE. Hooking providers use FI_TRACE. Will keep. fi_param_get_bool() - would like to change, but check C standard. Supported _Bool in C99, must include stdbool.h for bool. EFA can support rma_iov_count > 1. OPX references the field, but need to check if it's used > 1. Seems okay to not hold fi_ext.h definitions to backwards compatible ABI/API. ? EFA isn't using wait sets. Wait sets are used internally. No known uses of Wait Sets by attendees. Can we check? |
Beta Was this translation helpful? Give feedback.
-
Another option is to provide more significant changes. E.g. can we combine the fabric, domain, eq, av, cq into a single object? No plan to merge EPs, support for multiple EPs is required. Can create a layering API that gives this abstraction for ease of development. This would not provide benefits to the provider implementation, unless the provider can assume apps are using the 2.0 API. Apps that use separate CQs would need separate domains, which would require registering memory multiple times. Do we have apps that use multiple CQs? Amir believes there's at least 1 app (DDN?) that allocates 1 CQ per EP (connected EP). Check with Sylvain. Sandia openSHMEM may have an app with multiple EPs and CQs - 1 CQ per EP. Is a single CQ per EP sufficient? |
Beta Was this translation helpful? Give feedback.
-
Updates from Nov 1 ofiwg: No strong use case for having separate FI_SEND / FI_RECV CQs per EP. This impacts standard EPs only. Scalable EPs should be unaffected. Should be able to cleanup the AV calls. HPE will check if FI_SYMMETRIC is used. They may have an application use of it. No comments on removing atomic valid calls. Can check to move implementation into header files, with inline calling the query functions. Keep FI_AFFINITY FI_AV_USER_ID - added to shm provider. Is useful for peer providers. Want to keep and implement. Thread model - keeping FI_THREAD_DOMAIN. Cornelis (thread safe only?) and HPE will check on their models. In many cases, the EP and CQ must be used together to provide lockless operation. Goal is guide apps on model that they can implement toward for lockless operation. Assumption is apps may code for one model, but not multiple. Proposal assumes thread_domain gives the broadest implementation option. |
Beta Was this translation helpful? Give feedback.
-
Would it be possible to add the libfabric event type and backend (like verbs) event types in struct fi_eq_err_entry? At a minimum it would help log/debug message. |
Beta Was this translation helpful? Give feedback.
-
From OFIWG on Feb 21: Most notes captured in slide set. There was extended discussion on whether the mr_mode could be simpler for apps. That was the original intent of BASIC and SCALABLE options, which did not work and had to be replaced. Although some providers may be able to handle a mr_mode bit being set or not (e.g. PROV_KEY), not all will be able to, and they may not be able to support having a mr_mode re-enabled. PROV_KEY is an exception, in that a provider could always just increment a value. Of course, that forces the app to exchange the key. Other bits, such as VIRT_ADDR, become part of the wire protocol and cannot be supported by some providers. Apps will need to continue handling mr_mode bits being set/cleared if they want to support multiple providers. Otherwise, they can fail at fi_getinfo. |
Beta Was this translation helpful? Give feedback.
-
Attaching latest copy of slides presented at ofiwg. |
Beta Was this translation helpful? Give feedback.
-
2024 Roadmap including 2.0 schedule (from 2/6/2024 OFIWG meeting): |
Beta Was this translation helpful? Give feedback.
-
Cisco asked to keep usnic. |
Beta Was this translation helpful? Give feedback.
-
For bookkeeping, here is the deck of slides presented at OFIWG meeting on 6/11/2024 for the updated 2.0 strategy and timeline: The schedule is mostly the same as the above roadmap. The changes are: 2.0alpha is behind schedule, now targeting late August. |
Beta Was this translation helpful? Give feedback.
-
Discussion on whether we should move toward a 2.0 release, which allows breaking API/ABI compatibility, with a proposed target date in the second half of 2023, after a v1.19 release.
General proposed goal is: 2.0 should be a drop in replacement for most existing binaries that were compiled and work against version 1.X (with X to be decided).
Purpose of 2.0 release is to simplify the API by removing little used features and re-examine how narrow use features are exposed. The first objective is to list potential features to remove, identify if there are users, and update the list accordingly. This is a first pass at features to examine for removal or re-work, grouped by header files. Listing as a task list with expectation that marking as 'done' means the item has been discussed, not that it's been selected for removal. (If there's a different markdown option here
fi_log.h
fi_prov.h
fi_trigger.h
fi_tagged.h
fi_rma.h
fi_ext.h
fi_errno.h
fi_eq.h
fi_endpoint.h
BUFFERED_MIN, BUFFERED_LIMIT, SEND_BUF_SIZE, RECV_BUF_SIZE,
TX_SIZE, RX_SIZE
fi_domain.h
fi_collective.h
fi_cm.h
fi_atomic.h
fabric.h
field removal has potential to impact most apps (want compatibility)
E.g. app must use fi_allocinfo or fi_dupinfo
Providers
Beta Was this translation helpful? Give feedback.
All reactions