Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why isn’t VRAM being released after training LoRA? #9876

Open
hjw-0909 opened this issue Nov 6, 2024 · 7 comments
Open

Why isn’t VRAM being released after training LoRA? #9876

hjw-0909 opened this issue Nov 6, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@hjw-0909
Copy link

hjw-0909 commented Nov 6, 2024

Describe the bug

When I use train_dreambooth_lora_sdxl.py, the VRAM is not released after training. How can I fix this?

Reproduction

Not used.

Logs

No response

System Info

  • 🤗 Diffusers version: 0.31.0.dev0
  • Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.17
  • Running on Google Colab?: No
  • Python version: 3.8.20
  • PyTorch version (GPU?): 2.2.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.25.2
  • Transformers version: 4.45.2
  • Accelerate version: 1.0.1
  • PEFT version: 0.13.2
  • Bitsandbytes version: 0.44.1
  • Safetensors version: 0.4.5
  • xFormers version: not installed
  • Accelerator: NVIDIA H800, 81559 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

@hjw-0909 hjw-0909 added the bug Something isn't working label Nov 6, 2024
@SahilCarterr
Copy link
Contributor

I think you should Try manual flushing the gpu memory
To see PID of the process sudo fuser -v /dev/nvidia*
Then kill the PID that you no longer need
sudo kill -9 PID.
@hjw-0909

@hjw-0909
Copy link
Author

hjw-0909 commented Nov 6, 2024

@SahilCarterr I mean that after training, I want to perform other tasks without ending the entire Python script. In theory, VRAM should be released once train_lora.py completes the training, but it isn’t being freed.

@charchit7
Copy link
Contributor

As @SahilCarterr mentioned your process might be stalled.
Or try freeing the GPU mem in your code after the training loop is completed.

@hjw-0909
Copy link
Author

hjw-0909 commented Nov 6, 2024

@charchit7 I have added torch.cuda.empty_cache() after train, but it not worked.

@sayakpaul
Copy link
Member

Can you show a snap of the memory to make sure the memory is actually not released?

@hjw-0909
Copy link
Author

hjw-0909 commented Nov 8, 2024

@sayakpaul I ensure that the memory is properly released at the end of the .py script. However, I have noticed that after training with LoRA, the memory isn't fully released.

@sayakpaul
Copy link
Member

I ensure that the memory is properly released at the end of the .py script.

I don't understand what does this mean. Could you explain further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants