You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We are assessing the quality and performance impacts of utilizing FP8 with Meta-Llama-3-70B and multiple LORA adapters. We use a high rank of either 64 or 128 with our adapters resulting in an adapter size of up to 3.5GB. We didn't see any documentation describing how to accomplish the best quality results with this setup. We are currently running the following assessments:
Quantize the base model weights only and leave the adapters in FP16 with online dynamic quantization to assess response quality
Quantize a merged model with one of our adapters utilizing the FP8 KV example to assess quality and performance
Describe the solution you'd like
We would like to see more direction on practices we should consider for getting the best quality and performance outcomes for a scenario where we have a 70B base model with up to 10 large lora adapters. Options we are considering include quantization-aware fine tuning of our adapters or running our model/adapters with mixed precision. Also, there is a calibration step in the FP8 KV example, we are wondering if we should create a superset from our training datasets and/or production requests to improve the results.
The text was updated successfully, but these errors were encountered:
Hi @paulliwog, thanks for using the library and sorry about the delayed response!
For now, we recommend quantizing the base model weights to FP8 and there should be no need for QAT for this base. We then recommend keeping your adapters at FP16 and fully converging those individually on your varied datasets. Given the smaller size for the adapters, unless you're seeing significant performance degradation, we recommend keeping those at FP16 and not folding them so you can keep a single base quantized model that is shared across them and not worry about merging across multiple adapers.
There are a lot of potential options here and most are dictated based on what you have available for datasets, training compute, inference compue, and how many models you plan to host in parallel.
Let me know if you'd like to dive into more detail and I'm happy to set some time aside to walk through everything.
Thanks for the information. I have some test results from both a merged model that we compressed to FP8 and lora adapters in FP16 over an FP8 base model as you recommended above. We do see a quality difference in responses with the merged model providing higher quality generations. If you have time to do a walk through with us about some other options we can try, that would be great! Our current cost and hardware availability constraints require us to run at least 10 lora adapters in parallel per inference node.
markmc
pushed a commit
to markmc/llm-compressor
that referenced
this issue
Nov 13, 2024
Is your feature request related to a problem? Please describe.
We are assessing the quality and performance impacts of utilizing FP8 with Meta-Llama-3-70B and multiple LORA adapters. We use a high rank of either 64 or 128 with our adapters resulting in an adapter size of up to 3.5GB. We didn't see any documentation describing how to accomplish the best quality results with this setup. We are currently running the following assessments:
Describe the solution you'd like
We would like to see more direction on practices we should consider for getting the best quality and performance outcomes for a scenario where we have a 70B base model with up to 10 large lora adapters. Options we are considering include quantization-aware fine tuning of our adapters or running our model/adapters with mixed precision. Also, there is a calibration step in the FP8 KV example, we are wondering if we should create a superset from our training datasets and/or production requests to improve the results.
The text was updated successfully, but these errors were encountered: