Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841

SylU0 · 2024-09-05T02:14:14Z

I'm facing an issue with gradient overflow when training a model using specific combinations of GPUs in a multi-GPU setup with 4 identical NVIDIA RTX 3090 GPUs on a single machine.

The issue occurs only when using GPU 2 and GPU 3 simultaneously. When I use GPUs 0, 1, and 2, or GPUs 0, 1, and 3, the training works fine. However, when using GPUs 2 and 3 together, I consistently encounter "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to..." at the start of training.

!nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X PXB PXB PXB 0-25,52-77 0
GPU1 PXB X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X PIX 0-25,52-77 0
GPU3 PXB PXB PIX X 0-25,52-77 0

Changing the GPU combinations to avoid using GPUs 2 and 3 together solves the issue.
The problem seems tied to the PCIe connection between GPU 2 and GPU 3, but I’m not sure how to further address it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841

Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841

SylU0 commented Sep 5, 2024

Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841

Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841

Comments

SylU0 commented Sep 5, 2024