You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm facing an issue with gradient overflow when training a model using specific combinations of GPUs in a multi-GPU setup with 4 identical NVIDIA RTX 3090 GPUs on a single machine.
The issue occurs only when using GPU 2 and GPU 3 simultaneously. When I use GPUs 0, 1, and 2, or GPUs 0, 1, and 3, the training works fine. However, when using GPUs 2 and 3 together, I consistently encounter "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to..." at the start of training.
!nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X PXB PXB PXB 0-25,52-77 0
GPU1 PXB X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X PIX 0-25,52-77 0
GPU3 PXB PXB PIX X 0-25,52-77 0
Changing the GPU combinations to avoid using GPUs 2 and 3 together solves the issue.
The problem seems tied to the PCIe connection between GPU 2 and GPU 3, but I’m not sure how to further address it.
The text was updated successfully, but these errors were encountered:
I'm facing an issue with gradient overflow when training a model using specific combinations of GPUs in a multi-GPU setup with 4 identical NVIDIA RTX 3090 GPUs on a single machine.
The issue occurs only when using GPU 2 and GPU 3 simultaneously. When I use GPUs 0, 1, and 2, or GPUs 0, 1, and 3, the training works fine. However, when using GPUs 2 and 3 together, I consistently encounter "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to..." at the start of training.
!nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X PXB PXB PXB 0-25,52-77 0
GPU1 PXB X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X PIX 0-25,52-77 0
GPU3 PXB PXB PIX X 0-25,52-77 0
Changing the GPU combinations to avoid using GPUs 2 and 3 together solves the issue.
The problem seems tied to the PCIe connection between GPU 2 and GPU 3, but I’m not sure how to further address it.
The text was updated successfully, but these errors were encountered: