Use fault tolerant symlink during training to handle temporary file system inconsistencies. #1093
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Distributed file systems are not always synchronous. For example, a deleted file may still appear to be present until the file system synchronizes. This can cause
FileExistsError
s during training when updating symlinks to the best/current parameter files.This pull request adds and applies a fault tolerant symlink wrapper that makes multiple attempts before raising an
OSError
. By default, 6 attempts are made over the course of 63 seconds. This should prevent brief file system inconsistencies from causing errors during parameter file symlink updates.Unrelated: The latest version of
deepspeed
is not compatible with our current usage. The requirement is updated todeepspeed==0.6.5
.Pull Request Checklist
until you can check this box.
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.