Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in PipeEngine._exec_recv_grads #5518

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

i4never
Copy link

@i4never i4never commented May 10, 2024

I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:

  • hidden_state (floating, need grad)
  • attention_mask (int32, no grad)
  • cached_rotray_embedding (floating, no grad)

Only first tensor has grad which meets the restriction of PipelineEngine here:

# TODO: Improve pipe partitioning to pass multiple tensors that require grads
assert all([torch.is_tensor(elt) and elt.requires_grad is False for elt in outputs[1:]])
outputs_tail = outputs[1:]

Only grads of first tensor sended in first stage:

if self.is_grad_partitioned:
# First two sends are partitioned gradient
p2p.send(inputs[0], self.prev_stage)
p2p.send(inputs[1], self.prev_stage)

But the next stage try to recv more than one grad because tensor.is_floating_point() is used to filter outputs. In my case cached_rotray_embedding is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs.

if self.is_grad_partitioned:
sizes_and_dtypes = [(list(t.size()), t.dtype)
for t in outputs[:2]] + [(list(t.size()), t.dtype)
for t in outputs[2:] if t.is_floating_point()]

Since only one grad is send anyway, we don't need is_floating_point filter here.

@i4never i4never requested a review from duli2012 as a code owner May 10, 2024 02:45
@i4never i4never force-pushed the fix/pipeengine_communication branch from 3ccae4b to 890cccc Compare May 14, 2024 01:13
@i4never
Copy link
Author

i4never commented May 14, 2024

rebase master

@i4never i4never force-pushed the fix/pipeengine_communication branch 2 times, most recently from 0d5f2f5 to 5d10817 Compare May 15, 2024 01:48
@i4never
Copy link
Author

i4never commented May 15, 2024

Hi @tjruwase, Could you review this pls?

@tjruwase
Copy link
Contributor

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

@i4never i4never force-pushed the fix/pipeengine_communication branch 2 times, most recently from 156a2d7 to ad4fe8a Compare May 21, 2024 05:59
@i4never i4never force-pushed the fix/pipeengine_communication branch from ad4fe8a to 46eb620 Compare May 30, 2024 05:44
@i4never i4never force-pushed the fix/pipeengine_communication branch from 46eb620 to 50ec241 Compare June 14, 2024 01:31
@i4never i4never force-pushed the fix/pipeengine_communication branch from 34b1fd1 to 3a58893 Compare July 16, 2024 01:17
@loadams
Copy link
Contributor

loadams commented Jul 16, 2024

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

@i4never - would you be able to add any unit tests?

@i4never
Copy link
Author

i4never commented Jul 18, 2024

sure, I'm working on this.

@loadams
Copy link
Contributor

loadams commented Jul 24, 2024

sure, I'm working on this.

Thanks @i4never - ping me whenever this needs review/tests/etc.

@i4never i4never force-pushed the fix/pipeengine_communication branch from 9ae7461 to 29419dc Compare August 28, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants