Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fault tolerant symlink during training to handle temporary file system inconsistencies. #1093

Merged
merged 6 commits into from
Aug 10, 2023

Conversation

mjdenkowski
Copy link
Contributor

@mjdenkowski mjdenkowski commented Aug 10, 2023

Distributed file systems are not always synchronous. For example, a deleted file may still appear to be present until the file system synchronizes. This can cause FileExistsErrors during training when updating symlinks to the best/current parameter files.

This pull request adds and applies a fault tolerant symlink wrapper that makes multiple attempts before raising an OSError. By default, 6 attempts are made over the course of 63 seconds. This should prevent brief file system inconsistencies from causing errors during parameter file symlink updates.

Unrelated: The latest version of deepspeed is not compatible with our current usage. The requirement is updated to deepspeed==0.6.5.

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mjdenkowski mjdenkowski merged commit 8753d95 into main Aug 10, 2023
4 checks passed
@mjdenkowski mjdenkowski deleted the fault_tolerant_symlink branch August 10, 2023 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants