Throughput not scaling with additional GPUs #391

murtaza-nasir · 2024-04-07T09:46:31Z

murtaza-nasir
Apr 7, 2024

I am serving some LLM models using the Aphrodite engine on a system with 4x NVIDIA 3090 GPUs. I've noticed that I get the same inference throughput whether I run Aphrodite with 2 GPUs (-tp 2) or 4 GPUs (-tp 4).

When running with -tp 4, gpustat shows all 4 GPUs are being utilized at 100%. However, one of my two power supplies that powers 3 of the GPUs and the rest of the system is only pulling around 1150W.

As a test, I tried running two separate Aphrodite servers, each with the same model loaded, and used a separate ThreadPoolExecutor to feed requests to each server. In this dual server setup, the system power draw increases to ~1600W, indicating the GPUs are being utilized more heavily.

However, when I feed the same 1000 prompts to a single Aphrodite server, whether using 2 GPUs or 4 GPUs, the inference time is the same. I'm using a single ThreadPoolExecutor with the default worker limit to submit requests.

Is there any recommended way to improve the throughput and GPU utilization of a single Aphrodite server instance when running on multiple GPUs? I would expect the throughput to increase as more GPUs are used, but that doesn't seem to be the case in my current setup.

Any guidance on optimizing for multi-GPU inference would be much appreciated. Let me know if additional details about my configuration would be helpful.

Thanks!

Answered by AlpinDale

Apr 7, 2024

Hello!
I usually recommend spinning up an instance per 2x GPUs if your model is small. This is pretty much due to the fact that:

Your GPUs may not be nvlinked, and would suffer from excessive comms overhead through PCIe
There may not be P2P access between GPUs, which would also add excessive overhead
P2P is inefficient when paired with 4+ PCIe GPUs - generally not an issue with nvlink.

Also, we have the custom all-reduce kernels disabeld in main by default, so you're using torch's default all-reduce, which isn't very effective. They've been stabilized and enabled by default in the dev branch.

You can check if P2P access is available across ranks with this:

>>> import torch
>>> torch.cuda.…

View full answer

AlpinDale · 2024-04-07T10:08:18Z

AlpinDale
Apr 7, 2024
Maintainer

Hello!
I usually recommend spinning up an instance per 2x GPUs if your model is small. This is pretty much due to the fact that:

Your GPUs may not be nvlinked, and would suffer from excessive comms overhead through PCIe
There may not be P2P access between GPUs, which would also add excessive overhead
P2P is inefficient when paired with 4+ PCIe GPUs - generally not an issue with nvlink.

Also, we have the custom all-reduce kernels disabeld in main by default, so you're using torch's default all-reduce, which isn't very effective. They've been stabilized and enabled by default in the dev branch.

You can check if P2P access is available across ranks with this:

>>> import torch
>>> torch.cuda.can_device_access_peer(0, 1)

That would check for P2P access between GPUs 0 and 1.

9 replies

sgsdxzy Apr 8, 2024
Collaborator

They are added in the dev branch, as cmdline args.

murtaza-nasir Apr 8, 2024
Author

I tried adding them like this:

./runtime.sh python -m ...all arguments... --scheduler_reorder_window=1000 --scheduler_policy="reorder"

and a few other ways but couldn't get it to work.

AlpinDale Apr 8, 2024
Maintainer

If you're using the CLI, those would be --scheduler-reorder-window 1000 and so on. Snake case is for when directly calling the LLM class outside the API server.

murtaza-nasir Apr 8, 2024
Author

Thanks! Will try these. I tried last night but couldn't get it to work. I thought I had compiled the dev branch, but apparently not. I kept getting an unrecognized argument error for this.

murtaza-nasir Apr 9, 2024
Author

I can't get the dev version to go past loading the weights. It never loads the KVM cache, as the gpu memory never goes above the initial model weights allocation. I am in a ubuntu vm in proxmox. I am using the following command:

./runtime.sh python -m aphrodite.endpoints.openai.api_server --model TheBloke/orca_mini_v3_70B-GPTQ --download-dir "/home/murtaza/ML/Aphrodite/models" --dtype "half" --tensor-parallel-size 4 -gmu 0.9 -q gptq --served-model-name "localmodel" --chat-template "/home/murtaza/ML/Aphrodite/templates/orca-mini.jinja" --port 5000 --max-model-len 7000 --scheduler-reorder-window 1000 --scheduler-policy "reorder"

Switching back to the main branch for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput not scaling with additional GPUs #391

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Throughput not scaling with additional GPUs #391

murtaza-nasir Apr 7, 2024

Replies: 1 comment · 9 replies

AlpinDale Apr 7, 2024 Maintainer

sgsdxzy Apr 8, 2024 Collaborator

murtaza-nasir Apr 8, 2024 Author

AlpinDale Apr 8, 2024 Maintainer

murtaza-nasir Apr 8, 2024 Author

murtaza-nasir Apr 9, 2024 Author

murtaza-nasir
Apr 7, 2024

Replies: 1 comment 9 replies

AlpinDale
Apr 7, 2024
Maintainer

sgsdxzy Apr 8, 2024
Collaborator

murtaza-nasir Apr 8, 2024
Author

AlpinDale Apr 8, 2024
Maintainer

murtaza-nasir Apr 8, 2024
Author

murtaza-nasir Apr 9, 2024
Author