You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an issue while calculating the perplexity for a locally converted Llama3-8B sparse model using the llm-compress library. I'm refer the sparse conversion example script and change model to meta-llama/Meta-Llama-3-8B-Instruct by my self, the sparse conversion need ~ 1.2 hours to finish.
Here’s a detailed breakdown:
Describe the bug
While trying to compute the WikiText2 Perplexity for a Llama3-8B model that has been sparsified (load local model from disk), the resulting perplexity values always turn out to be NaN. I suspect that some configurations might not be properly set when using the custom SparseAutoModelForCausalLM class in combination with the compressed-tensors library.
Expected behavior
I expected the perplexity values to be reasonable and comparable to the official Hugging Face models. For example, when testing with the standard Llama-3.2-3B model from Hugging Face (without sparsification), I got a perplexity of around ~8.8 with the following parameters:
• max_length=16K
• stride=1, 2, 4, 8, 16K
I expected similar results for the sparse model, not NaN values.
Environment
I use RunPod online env with A100-80GB-SXM *2
To Reproduce
Steps to reproduce the behavior:
1. Convert the Llama3-8B model using llm-compress to a sparse version.
2. Load the sparse model using **_SparseAutoModelForCausalLM_** (same process [here](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_24_sparse_w4a16) ) and set up the environment to calculate perplexity.
3. Run perplexity calculation on WikiText2 dataset following Hugging Face’s [official perplexity guide](https://huggingface.co/docs/transformers/perplexity), but using the custom sparse model.
4. Observe the NaN perplexity values in the output.
Errors
Here’s the output I receive when running the perplexity calculation, see the attachment image. The perplexity of local Llama-8B model (load by SparseAutoModelForCausalLM class) always be NaN value. Test with Llama-3B model (load by AutoModelForCausalLM class) can successfully get ppl value.
Sparse Llama 8B (load by SparseAutoModelForCausalLM class) : ppl will be NaN
Load Online Llama 3B (load by AutoModelForCausalLM class) : successfully get ppl value
Additional context
The same perplexity calculation process works perfectly when using the Hugging Face Llama-3.2-3B model without sparsification, which gives a perplexity value of ~8.8. I believe the issue lies in either the custom sparse model class or the integration with compressed-tensors. Maybe I miss some additional configuration/setting of Sparse model ? 🧐
Any guidance on this would be appreciated! 🥰
Additional Question
How to load the final quantization model (i.e the model be saved in stage_quantization folder) correctly ?
I also interest of ppl of final quantization model, but I try load with SparseAutoModelForCausalLM it can not be work 😢
it shows some message mean : "... ... class not support ..."
So how to load the final quantization model correctly ? any documentation can be refer ? 🙏🏼
The text was updated successfully, but these errors were encountered:
Hi @robertgshaw2-neuralmagic Robert, you were right to question this. I retested the original llama-7B Sparse conversion example from llm-compressor today, along with a simple model.generate test to check the model's text output. It turns out the model doesn’t seem to generate any correct outputs, and as expected, I couldn’t calculate the model’s perplexity under these circumstances.
I think the issue is now clearer. I believe the problem lies in how I load the local Sparse Model & Tokenizer. Does llm-compressor have any examples or documentation I can refer to? Any suggestions would be appreciated, thank you! 🥰
Also, I apologize for not providing the exact sparse model I used. After running it in the online RunPod environment, I didn’t download the model. However, this process should be easy to replicate. Here are the steps I followed for testing:
Step 1: Execute the official llama-7B sparse conversion example from llm-compressor : run python llama7b_sparse_w4a16.py
Step 2: After about an hour, the sparse conversion finishes, and you’ll find the model saved in three stages in the output folder output_llama7b_2:4_w4a16_channel and I rename to output_llama7b_2_4_w4a16_channel for easy use.
Step 3: Load the stage_finetuing sparse model and Tokenizer in output_llama7b_2_4_w4a16_channel/stage_finetuning, and follow the HuggingFace process to calculate perplexity"
##The Success Case with Llama3-3B online model
Test Model Output
Calculating Perplexity
Result
Summary
I want to correctly load the local sparse model and calculate its perplexity as an evaluation metric. However, it seems that I haven’t used the correct method to load the model (through the SparseAutoModelForCausalLM class) or the Tokenizer. If there are any documents or resources I can refer to, please let me know. Thanks! 🥰
👋 Hello Neural Magic community developers,
I encountered an issue while calculating the perplexity for a locally converted Llama3-8B sparse model using the llm-compress library. I'm refer the sparse conversion example script and change model to meta-llama/Meta-Llama-3-8B-Instruct by my self, the sparse conversion need ~ 1.2 hours to finish.
Here’s a detailed breakdown:
Describe the bug
While trying to compute the WikiText2 Perplexity for a Llama3-8B model that has been sparsified (load local model from disk), the resulting perplexity values always turn out to be NaN. I suspect that some configurations might not be properly set when using the custom SparseAutoModelForCausalLM class in combination with the compressed-tensors library.
Expected behavior
I expected the perplexity values to be reasonable and comparable to the official Hugging Face models. For example, when testing with the standard Llama-3.2-3B model from Hugging Face (without sparsification), I got a perplexity of around ~8.8 with the following parameters:
I expected similar results for the sparse model, not NaN values.
Environment
I use RunPod online env with A100-80GB-SXM *2
To Reproduce
Steps to reproduce the behavior:
Errors
Here’s the output I receive when running the perplexity calculation, see the attachment image. The perplexity of local Llama-8B model (load by SparseAutoModelForCausalLM class) always be NaN value. Test with Llama-3B model (load by AutoModelForCausalLM class) can successfully get ppl value.
Sparse Llama 8B (load by SparseAutoModelForCausalLM class) : ppl will be NaN
Load Online Llama 3B (load by AutoModelForCausalLM class) : successfully get ppl value
Additional context
The same perplexity calculation process works perfectly when using the Hugging Face Llama-3.2-3B model without sparsification, which gives a perplexity value of ~8.8. I believe the issue lies in either the custom sparse model class or the integration with compressed-tensors. Maybe I miss some additional configuration/setting of Sparse model ? 🧐
Any guidance on this would be appreciated! 🥰
Additional Question
How to load the final quantization model (i.e the model be saved in stage_quantization folder) correctly ?
I also interest of ppl of final quantization model, but I try load with SparseAutoModelForCausalLM it can not be work 😢
it shows some message mean : "... ... class not support ..."
So how to load the final quantization model correctly ? any documentation can be refer ? 🙏🏼
The text was updated successfully, but these errors were encountered: