Use one param coordinator for both train/inference scenarios #6662

tohtana · 2024-10-23T18:05:57Z

The parameter coordinator in ZeRO3 throws a "backward pass is invalid for module in evaluation mode" error when the training mode is unexpected, as it expects all modules to be in training mode during the backward pass. This is an unnecessarily strict restriction.
This PR relaxes the restriction by using a single parameter coordinator (instead of separate ones for training and evaluation modes) and resetting the prefetch state before starting a forward pass.

Use of is_compiling needs to be fixed after #6663 is merged.

…soft/DeepSpeed into tohtana/simplify_param_coordinator

use one param coordinator for both train/inference modes

723289f

tohtana mentioned this pull request Oct 23, 2024

Clean all param coordinators #6661

Closed

tohtana and others added 7 commits October 23, 2024 21:20

fix initial trace mode

b3b7bc9

stop tracing for compiler from modifying prefetch info

c4ac674

remove arg to choose param coordinator

cb2b25d

Merge branch 'master' into tohtana/simplify_param_coordinator

ba3c00c

Merge branch 'master' into tohtana/simplify_param_coordinator

02d044e

test prefetching while switching modes

04d9dda

Merge branch 'tohtana/simplify_param_coordinator' of github.com:micro…

54525e0

…soft/DeepSpeed into tohtana/simplify_param_coordinator

tohtana marked this pull request as ready for review October 25, 2024 21:31

tohtana requested review from tjruwase and loadams as code owners October 25, 2024 21:31

tohtana and others added 8 commits October 25, 2024 14:32

Merge branch 'master' into tohtana/simplify_param_coordinator

92a0dd0

change test parameter

acb9efa

Merge branch 'tohtana/simplify_param_coordinator' of github.com:micro…

d24eb64

…soft/DeepSpeed into tohtana/simplify_param_coordinator

Merge branch 'master' into tohtana/simplify_param_coordinator

8d67ddd

Merge branch 'master' into tohtana/simplify_param_coordinator

2728865

Merge branch 'master' into tohtana/simplify_param_coordinator

ba9bd3c

Merge branch 'master' into tohtana/simplify_param_coordinator

85d7818

replace is_compiling calls with wrapper

db8f2cf

tohtana enabled auto-merge November 5, 2024 19:14

tjruwase approved these changes Nov 5, 2024

View reviewed changes

tohtana added this pull request to the merge queue Nov 5, 2024

Merged via the queue into master with commit 351569d Nov 6, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use one param coordinator for both train/inference scenarios #6662

Use one param coordinator for both train/inference scenarios #6662

tohtana commented Oct 23, 2024 •

edited

Loading

Use one param coordinator for both train/inference scenarios #6662

Use one param coordinator for both train/inference scenarios #6662

Conversation

tohtana commented Oct 23, 2024 • edited Loading

tohtana commented Oct 23, 2024 •

edited

Loading