Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

Open
paulliwog opened this issue Sep 11, 2024 · 2 comments
Open

[USAGE] FP8 W8A8 (+KV) with LORA Adapters #164

paulliwog opened this issue Sep 11, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@paulliwog
Copy link

paulliwog commented Sep 11, 2024

Is your feature request related to a problem? Please describe.
We are assessing the quality and performance impacts of utilizing FP8 with Meta-Llama-3-70B and multiple LORA adapters. We use a high rank of either 64 or 128 with our adapters resulting in an adapter size of up to 3.5GB. We didn't see any documentation describing how to accomplish the best quality results with this setup. We are currently running the following assessments:

  • Quantize the base model weights only and leave the adapters in FP16 with online dynamic quantization to assess response quality
  • Quantize a merged model with one of our adapters utilizing the FP8 KV example to assess quality and performance

Describe the solution you'd like
We would like to see more direction on practices we should consider for getting the best quality and performance outcomes for a scenario where we have a 70B base model with up to 10 large lora adapters. Options we are considering include quantization-aware fine tuning of our adapters or running our model/adapters with mixed precision. Also, there is a calibration step in the FP8 KV example, we are wondering if we should create a superset from our training datasets and/or production requests to improve the results.

@paulliwog paulliwog added the enhancement New feature or request label Sep 11, 2024
@markurtz markurtz self-assigned this Oct 18, 2024
@markurtz
Copy link
Collaborator

Hi @paulliwog, thanks for using the library and sorry about the delayed response!

For now, we recommend quantizing the base model weights to FP8 and there should be no need for QAT for this base. We then recommend keeping your adapters at FP16 and fully converging those individually on your varied datasets. Given the smaller size for the adapters, unless you're seeing significant performance degradation, we recommend keeping those at FP16 and not folding them so you can keep a single base quantized model that is shared across them and not worry about merging across multiple adapers.

There are a lot of potential options here and most are dictated based on what you have available for datasets, training compute, inference compue, and how many models you plan to host in parallel.

Let me know if you'd like to dive into more detail and I'm happy to set some time aside to walk through everything.

@paulliwog
Copy link
Author

Thanks for the information. I have some test results from both a merged model that we compressed to FP8 and lora adapters in FP16 over an FP8 base model as you recommended above. We do see a quality difference in responses with the merged model providing higher quality generations. If you have time to do a walk through with us about some other options we can try, that would be great! Our current cost and hardware availability constraints require us to run at least 10 lora adapters in parallel per inference node.

markmc pushed a commit to markmc/llm-compressor that referenced this issue Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants