Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Tensors are on different devices when model.step() #5422

Closed
yuezhao238 opened this issue Apr 16, 2024 · 17 comments
Closed

[BUG] Tensors are on different devices when model.step() #5422

yuezhao238 opened this issue Apr 16, 2024 · 17 comments
Assignees
Labels
bug Something isn't working training

Comments

@yuezhao238
Copy link

yuezhao238 commented Apr 16, 2024

Describe the bug
The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

@yuezhao238 yuezhao238 added bug Something isn't working training labels Apr 16, 2024
@cloudwaysX
Copy link

I have the same issue here when running the following script. https://github.com/allenai/open-instruct/blob/main/scripts/dpo_train_with_accelerate.sh. I notice that version 0.14.0 does not have this issue.

@wuxb45
Copy link

wuxb45 commented Apr 17, 2024

I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works.
I cross-checked with torch 2.2.1, 2.2.2 and transformers 4.39.0, 4.39.3. The issue is with 0.14.1 across all combinations.

Here is a snippet of the backtrace (but I'm sorry that I cannot provide the python code. Hope it helps):

  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@wuxb45
Copy link

wuxb45 commented Apr 17, 2024

@tjruwase @mrwyattii unscale_and_clip_grads's last update was two years ago.
It might be the recent change in L2030 due to the use of norm(): https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2030

The commit:
54c0687

@tjruwase
Copy link
Contributor

@wuxb45, @Heathcliff-Zhao, and @cloudwaysX thanks for reporting and triaging this issue.

@Kwen-Chen
Copy link
Contributor

Yes,I have the same issue when i use the deepspeed's version of 0.14.1, so I do that:

pip uninstall deepspeed 
pip install deepspeed==0.14.0

after use the deepspeed of 0.14.0, it worked!

@jomayeri
Copy link
Contributor

@Heathcliff-Zhao I am struggling to repro your code.
image

@yuezhao238
Copy link
Author

@jomayeri Can you show all files in thudm/chatglm-6b folder? Are there tokenizer related files in it?

@jomayeri
Copy link
Contributor

@Heathcliff-Zhao there are no tokenizer files in it.

@yuezhao238
Copy link
Author

@jomayeri The folder should contain these files below. You can download the missing file from https://huggingface.co/THUDM/chatglm3-6b/tree/main
image

@jomayeri
Copy link
Contributor

@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? --model_name_or_path ./opensource is incorrect because that directory does not exist and specifying --model_name_or_path THUDM/chatglm3-6b to download the model from huggingface also does not work.

@yuezhao238
Copy link
Author

Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

@jomayeri I use this command. ./opensource in my command is a soft link of model path. You should modify it according to the place you save the model weights. I suggest you to double check whether there is a tokenizer.model file in THUDM/chatglm3-6b.

@wuxb45
Copy link

wuxb45 commented Apr 25, 2024

Version 0.14.2 still has the same issue.

@kno10
Copy link

kno10 commented May 2, 2024

I can reproduce this in 0.14.2, it seems to have been reverted in #5461 the day after 0.14.2; so likely fixed in the next version 0.14.3.

@loadams
Copy link
Contributor

loadams commented May 6, 2024

Hi @kno10 - can you confirm that if you build from master that things work?

@kno10
Copy link

kno10 commented May 6, 2024

I did not try. I downgraded to 0.14.0 to get things back running as quickly as possible.

@loadams
Copy link
Contributor

loadams commented May 7, 2024

@kno10 - makes sense, thanks. Just wanted to confirm this fixed the issue you were hitting.

@jomayeri
Copy link
Contributor

Closing with the same comment as #5538.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

8 participants