Skip to content

Throughput not scaling with additional GPUs #391

Answered by AlpinDale
murtaza-nasir asked this question in Q&A
Discussion options

You must be logged in to vote

Hello!
I usually recommend spinning up an instance per 2x GPUs if your model is small. This is pretty much due to the fact that:

  1. Your GPUs may not be nvlinked, and would suffer from excessive comms overhead through PCIe
  2. There may not be P2P access between GPUs, which would also add excessive overhead
  3. P2P is inefficient when paired with 4+ PCIe GPUs - generally not an issue with nvlink.

Also, we have the custom all-reduce kernels disabeld in main by default, so you're using torch's default all-reduce, which isn't very effective. They've been stabilized and enabled by default in the dev branch.

You can check if P2P access is available across ranks with this:

>>> import torch
>>> torch.cuda.

Replies: 1 comment 9 replies

Comment options

You must be logged in to vote
9 replies
@sgsdxzy
Comment options

@murtaza-nasir
Comment options

@AlpinDale
Comment options

@murtaza-nasir
Comment options

@murtaza-nasir
Comment options

Answer selected by murtaza-nasir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants