Throughput not scaling with additional GPUs #391
-
I am serving some LLM models using the Aphrodite engine on a system with 4x NVIDIA 3090 GPUs. I've noticed that I get the same inference throughput whether I run Aphrodite with 2 GPUs (-tp 2) or 4 GPUs (-tp 4). When running with -tp 4, gpustat shows all 4 GPUs are being utilized at 100%. However, one of my two power supplies that powers 3 of the GPUs and the rest of the system is only pulling around 1150W. As a test, I tried running two separate Aphrodite servers, each with the same model loaded, and used a separate ThreadPoolExecutor to feed requests to each server. In this dual server setup, the system power draw increases to ~1600W, indicating the GPUs are being utilized more heavily. However, when I feed the same 1000 prompts to a single Aphrodite server, whether using 2 GPUs or 4 GPUs, the inference time is the same. I'm using a single ThreadPoolExecutor with the default worker limit to submit requests. Is there any recommended way to improve the throughput and GPU utilization of a single Aphrodite server instance when running on multiple GPUs? I would expect the throughput to increase as more GPUs are used, but that doesn't seem to be the case in my current setup. Any guidance on optimizing for multi-GPU inference would be much appreciated. Let me know if additional details about my configuration would be helpful. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
Hello!
Also, we have the custom all-reduce kernels disabeld in main by default, so you're using torch's default all-reduce, which isn't very effective. They've been stabilized and enabled by default in the dev branch. You can check if P2P access is available across ranks with this: >>> import torch
>>> torch.cuda.can_device_access_peer(0, 1) That would check for P2P access between GPUs 0 and 1. |
Beta Was this translation helpful? Give feedback.
Hello!
I usually recommend spinning up an instance per 2x GPUs if your model is small. This is pretty much due to the fact that:
Also, we have the custom all-reduce kernels disabeld in main by default, so you're using torch's default all-reduce, which isn't very effective. They've been stabilized and enabled by default in the dev branch.
You can check if P2P access is available across ranks with this: