-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabtests: make source addressing usage more user friendly and universal #9807
Comments
The error returns from fi_getinfo need significant improvement. In general an error like ret=-61 (No data available) means libfabric attempted to enumerate all the NICs, but did not find an acceptable provider which offered an acceptable NIC. This is an error which often occurs in customers and lacks any actionable information. Usually, the cause is the desired provider was not available on the system or the desired provider was unable to find an acceptable NIC to offer. The next step is often to repeat the test with FI_LOG_LEVEL=info. However some patches in these code paths a couple years ago (commit f4715e8) made FI_WARN and FI_INFO calls into FI_DBG, so typical non-debug builds lack the key messages about device and provider discovery which are needed to debug this. So end users and in-distro libfabric users are typically stuck atthis point and must resort to provider specific mechanisms to debug what is happening or must locate libfabric source and rebuild it with debug (making sure not to change other options. A task which is beyond that of a typical sysadmin using an in-distro libfabric or an ISV provided MPI or application stack which includes libfabric). The ideal customer facing answer would be for provider enumeration to accumulate a set of text messages from each provider and when a provider fails to find an acceptable device, the provider could provide a more detailed string as to why (probably a list of strings reflecting NICs it looked at and why it rejected them). Then if the fi_getinfo fails to find any provider, fi_getinfo could output (or return) a detailed message showing what providers it attempted and why they each indicated they could not find a device. Such strings may be long. I've implemented logging mechanisms like this in past products and it amounted to retaining a tree of error messages, with a list per provider and then only outputting the tree at the higher level routine where the issue was "realized" and discarding the tree if at least 1 provider successfully found NICs. |
Running the same fab test with FI_LOG_LEVEL=info
|
Hi there @danielap1996 and thanks for opening the issue! |
That response was supper fast !!!
Could you please change the test to be a bit more "friendly" to users? |
@danielap1996 Yeah there are definitely some issues with fabtests in regards to how it handles source addressing. This is because some providers handle it differently so it's difficult to make a universal solution that is also correct with the API without forcing something that works. I'm going to change your issue title to reflect the request in clarification so we can track it and make sure we address it in the future. |
Hi, I was trying to run some of teh fabtest tests but they were getting fail on fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)
This is how I was install libfabric:
This is how I was install fatests:
test run example:
fi_info -l output:
The text was updated successfully, but these errors were encountered: