-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diffusion-nodes can fail loading the checkpoint on some rank #281
Comments
Another run
|
Note that the files are different on both errors
Note that unet is loaded AFTER vae so it seems like the error follows the load order |
Files are all local to the machine on |
Note that after a few retries it works, seems the issue is timing related. This issue seemed to have appeared on the H100 nodes, and I don't recall seeing it on A100. |
It's in /tmp so it would be an issue with the node's FS? 🤔 Does it make a difference if you swap which of the 2 nodes you use as the master? |
Olexa suggests it might be a file open ulimit issue |
Note that the file does exist and was rsync to the local node beforehand.
Additionally, the error only appears on rank14, if the file was missing we would expect all rank[8-15] to print and error.
The text was updated successfully, but these errors were encountered: