OOM, deepseek v2 code lite on A40 gpus #885

tohnee · 2024-11-01T03:24:44Z

Describe the bug
A clear and concise description of what the bug is.
2024-11-01T03:22:01.826294+0000 | one_shot | INFO - *** One Shot ***
2024-11-01T03:22:01.830653+0000 | from_modifiers | INFO - Creating recipe from modifiers
2024-11-01T03:22:01.831398+0000 | create_instance | WARNING - Could not process input as a file path or zoo stub, attempting to process it as a string.
2024-11-01T03:22:01.875859+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
2024-11-01T03:22:01.903544+0000 | on_initialize_structure | WARNING - GPTQ quantization is set to True without an active quantization modifier.
2024-11-01T03:22:01.903643+0000 | _build_quant_modifier | INFO - Building quantization modifier with args: {'targets': 'Linear', 'scheme': 'W8A8', 'ignore': ['lm_head', 're:.*mlp.gate$']}
2024-11-01T03:22:02.877828+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
2024-11-01T03:22:05.311851+0000 | initialize_compression | INFO - Preparing model.layers.0 for compression
2024-11-01T03:22:05.319877+0000 | initialize_compression | INFO - Preparing model.layers.1 for compression
2024-11-01T03:22:10.454367+0000 | initialize_compression | INFO - Preparing model.layers.2 for compression
2024-11-01T03:22:14.264467+0000 | initialize_compression | INFO - Preparing model.layers.3 for compression
2024-11-01T03:22:20.000751+0000 | initialize_compression | INFO - Preparing model.layers.4 for compression
2024-11-01T03:22:24.437408+0000 | initialize_compression | INFO - Preparing model.layers.5 for compression
Traceback (most recent call last):
File "/testspace/repo/deepseek/llm-compressor/examples/quantizing_moe/deepseek_moe_w8a8_int8.py", line 79, in
oneshot(
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
main(model_args, data_args, training_args)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 359, in main
stage_runner.one_shot()
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 401, in one_shot
apply(
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/core/session.py", line 210, in apply
self.initialize(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/core/session.py", line 156, in initialize
mod_data = self._lifecycle.initialize(
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 187, in on_initialize
self.initialize_compression(modifiable_model, calibration_dataloader)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 246, in initialize_compression
compressor.pre_compress()
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/utils/layer_compressor.py", line 79, in pre_compress
wrapper = self.module_compressor_class(full_name, layer)
File "/opt/conda/lib/python3.10/site-packages/llmcompressor/modifiers/quantization/gptq/utils/gptq_wrapper.py", line 45, in init
"H", torch.zeros((self.columns, self.columns), device=self.dev)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 44.55 GiB of which 7.69 MiB is free. Process 689439 has 1.40 GiB memory in use. Process 3382592 has 260.00 MiB memory in use. Process 3845982 has 42.87 GiB memory in use. Of the allocated memory 42.56 GiB is allocated by PyTorch, and 18.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
root@s0pgpuap12:/testspace/repo/deepseek/llm-compressor/examples/quantizing_moe# CUDA_VISIBLE_DEVICES=0,1,2,5 python deepseek_moe_w8a8_int8.py
The repository for /testspace/DeepSeek-Coder-V2-Lite-Instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//testspace/DeepSeek-Coder-V2-Lite-Instruct.
You can avoid this prompt in future by passing the argument trust_remote_code=True.

Do you wish to run the custom code? [y/N] y
The repository for /testspace/DeepSeek-Coder-V2-Lite-Instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//testspace/DeepSeek-Coder-V2-Lite-Instruct.
You can avoid this prompt in future by passing the argument trust_remote_code=True.

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]:
Python version [e.g. 3.7]:
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]:
ML framework version(s) [e.g. torch 2.3.1]:
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce
Exact steps to reproduce the behavior:

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

tohnee · 2024-11-01T03:25:24Z

how to set tensor parallel or cpu offload？

dsikka · 2024-11-04T18:29:15Z

Hi! We have examples you can follow which use cpu offloading:
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantizing_moe

tohnee added the bug Something isn't working label Nov 1, 2024

dsikka self-assigned this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM, deepseek v2 code lite on A40 gpus #885

OOM, deepseek v2 code lite on A40 gpus #885

tohnee commented Nov 1, 2024

tohnee commented Nov 1, 2024

dsikka commented Nov 4, 2024

OOM, deepseek v2 code lite on A40 gpus #885

OOM, deepseek v2 code lite on A40 gpus #885

Comments

tohnee commented Nov 1, 2024

tohnee commented Nov 1, 2024

dsikka commented Nov 4, 2024