[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

paulliwog · 2024-09-11T16:47:36Z

Is your feature request related to a problem? Please describe.
We are assessing the quality and performance impacts of utilizing FP8 with Meta-Llama-3-70B and multiple LORA adapters. We use a high rank of either 64 or 128 with our adapters resulting in an adapter size of up to 3.5GB. We didn't see any documentation describing how to accomplish the best quality results with this setup. We are currently running the following assessments:

Quantize the base model weights only and leave the adapters in FP16 with online dynamic quantization to assess response quality
Quantize a merged model with one of our adapters utilizing the FP8 KV example to assess quality and performance

Describe the solution you'd like
We would like to see more direction on practices we should consider for getting the best quality and performance outcomes for a scenario where we have a 70B base model with up to 10 large lora adapters. Options we are considering include quantization-aware fine tuning of our adapters or running our model/adapters with mixed precision. Also, there is a calibration step in the FP8 KV example, we are wondering if we should create a superset from our training datasets and/or production requests to improve the results.

markurtz · 2024-10-18T01:50:21Z

Hi @paulliwog, thanks for using the library and sorry about the delayed response!

For now, we recommend quantizing the base model weights to FP8 and there should be no need for QAT for this base. We then recommend keeping your adapters at FP16 and fully converging those individually on your varied datasets. Given the smaller size for the adapters, unless you're seeing significant performance degradation, we recommend keeping those at FP16 and not folding them so you can keep a single base quantized model that is shared across them and not worry about merging across multiple adapers.

There are a lot of potential options here and most are dictated based on what you have available for datasets, training compute, inference compue, and how many models you plan to host in parallel.

Let me know if you'd like to dive into more detail and I'm happy to set some time aside to walk through everything.

paulliwog · 2024-10-18T17:56:36Z

Thanks for the information. I have some test results from both a merged model that we compressed to FP8 and lora adapters in FP16 over an FP8 base model as you recommended above. We do see a quality difference in responses with the merged model providing higher quality generations. If you have time to do a walk through with us about some other options we can try, that would be great! Our current cost and hardware availability constraints require us to run at least 10 lora adapters in parallel per inference node.

paulliwog added the enhancement New feature or request label Sep 11, 2024

markurtz self-assigned this Oct 18, 2024

markmc pushed a commit to markmc/llm-compressor that referenced this issue Nov 13, 2024

update key (vllm-project#164)

f3d9ec2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

paulliwog commented Sep 11, 2024 •

edited

Loading

markurtz commented Oct 18, 2024

paulliwog commented Oct 18, 2024

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

Comments

paulliwog commented Sep 11, 2024 • edited Loading

markurtz commented Oct 18, 2024

paulliwog commented Oct 18, 2024

paulliwog commented Sep 11, 2024 •

edited

Loading