-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the speed does not increase after compressed it? #852
Comments
It looks like you are running the FP16 model in your launch command That being said, you are running a 3b model with tp=8. I do not think you will see much performance benefit from fp8 in this regime since the linear layers are very small in this setup |
Sorry for the typo, its should be 8b model python -m vllm.entrypoints.openai.api_server --served-model-name Llama-3.1-8B-Instruct-FP8 --model /root/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 --port 8000 --host 0.0.0.0 --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --dtype bfloat16 --quantization compressed-tensors any idea to speed up with vllm for compressed the model? |
One last question - is this running on an H100? |
yeap, 8xh100 smx5, can I add you in discord for further details sharing? |
With an 8xh100, your system is very overpowered for running an 8b parameter model, so the e2e speedup from quantization is small (and we have not really tuned the fp8 kernels for matrices that are so skinny). I would expect to see speedups on a 1xh100 for an 8b parameter scale though |
Same problem. I have run a fp8 quantized minicpm3 (4B) on a L40, and only see less than 10% speedup |
Could you share any more details on your workload? For L40 with 8B model scale, I have measured ~30-50% speedup for offline batch workload. |
Here's my test case: origin model: https://huggingface.co/openbmb/MiniCPM3-4B I test one same request 10 times with different batchsize(bs), and below is the avg time cost: bs = 1 bs=2 bs=4 bs=8 Besides, I found that if set max_model_len bigger (2048 -> 8192), the time cost may be slightly lower ,like 1.78 -> 1.72 @ (bs=8, quantized), interesting |
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16
i tried these example code to generate a new compressed checkpoint and load with vllm 0.6.3
python -m vllm.entrypoints.openai.api_server --served-model-name /home/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.1-8B-Instruct-FP8 --model meta-llama/Llama-3.1-8B-Instruct --port 8000 --host 0.0.0.0 --tensor-parallel-size 8 --gpu-memory-utilization 0.98
base model: 215 tok/s
compressed model: 205 tok/s
The text was updated successfully, but these errors were encountered: