[BUG] Tensors are on different devices when model.step() #5422

yuezhao238 · 2024-04-16T18:39:09Z

Describe the bug
The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

The text was updated successfully, but these errors were encountered:

cloudwaysX · 2024-04-17T02:30:01Z

I have the same issue here when running the following script. https://github.com/allenai/open-instruct/blob/main/scripts/dpo_train_with_accelerate.sh. I notice that version 0.14.0 does not have this issue.

wuxb45 · 2024-04-17T08:40:14Z

I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works.
I cross-checked with torch 2.2.1, 2.2.2 and transformers 4.39.0, 4.39.3. The issue is with 0.14.1 across all combinations.

Here is a snippet of the backtrace (but I'm sorry that I cannot provide the python code. Hope it helps):

  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

wuxb45 · 2024-04-17T08:57:27Z

@tjruwase @mrwyattii unscale_and_clip_grads's last update was two years ago.
It might be the recent change in L2030 due to the use of norm(): https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2030

The commit:
54c0687

tjruwase · 2024-04-17T15:00:52Z

@wuxb45, @Heathcliff-Zhao, and @cloudwaysX thanks for reporting and triaging this issue.

Kwen-Chen · 2024-04-18T06:19:25Z

Yes,I have the same issue when i use the deepspeed's version of 0.14.1, so I do that:

pip uninstall deepspeed 
pip install deepspeed==0.14.0

after use the deepspeed of 0.14.0, it worked!

jomayeri · 2024-04-18T22:15:33Z

@Heathcliff-Zhao I am struggling to repro your code.

yuezhao238 · 2024-04-19T04:04:11Z

@jomayeri Can you show all files in thudm/chatglm-6b folder? Are there tokenizer related files in it?

jomayeri · 2024-04-19T18:23:28Z

@Heathcliff-Zhao there are no tokenizer files in it.

yuezhao238 · 2024-04-20T04:24:10Z

@jomayeri The folder should contain these files below. You can download the missing file from https://huggingface.co/THUDM/chatglm3-6b/tree/main

jomayeri · 2024-04-22T20:47:41Z

@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? --model_name_or_path ./opensource is incorrect because that directory does not exist and specifying --model_name_or_path THUDM/chatglm3-6b to download the model from huggingface also does not work.

yuezhao238 · 2024-04-22T21:47:50Z

Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

@jomayeri I use this command. ./opensource in my command is a soft link of model path. You should modify it according to the place you save the model weights. I suggest you to double check whether there is a tokenizer.model file in THUDM/chatglm3-6b.

wuxb45 · 2024-04-25T14:36:51Z

Version 0.14.2 still has the same issue.

kno10 · 2024-05-02T10:34:29Z

I can reproduce this in 0.14.2, it seems to have been reverted in #5461 the day after 0.14.2; so likely fixed in the next version 0.14.3.

loadams · 2024-05-06T16:31:07Z

Hi @kno10 - can you confirm that if you build from master that things work?

kno10 · 2024-05-06T18:56:52Z

I did not try. I downgraded to 0.14.0 to get things back running as quickly as possible.

loadams · 2024-05-07T17:49:44Z

@kno10 - makes sense, thanks. Just wanted to confirm this fixed the issue you were hitting.

jomayeri · 2024-05-22T18:41:44Z

Closing with the same comment as #5538.

yuezhao238 added bug Something isn't working training labels Apr 16, 2024

tjruwase assigned jomayeri Apr 17, 2024

jomayeri mentioned this issue May 22, 2024

[BUG] Version >0.14.0 leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #5538

Closed

jomayeri closed this as completed May 22, 2024

jomayeri mentioned this issue May 23, 2024

Z3: optimizations for grad norm calculation and gradient clipping #5504

Merged

1SingleFeng mentioned this issue Jun 14, 2024

[BUG] 如何合并lora后的权重 OpenBMB/MiniCPM-V#260

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Tensors are on different devices when model.step() #5422

[BUG] Tensors are on different devices when model.step() #5422

yuezhao238 commented Apr 16, 2024 •

edited

Loading

cloudwaysX commented Apr 17, 2024

wuxb45 commented Apr 17, 2024

wuxb45 commented Apr 17, 2024 •

edited

Loading

tjruwase commented Apr 17, 2024

Kwen-Chen commented Apr 18, 2024

jomayeri commented Apr 18, 2024

yuezhao238 commented Apr 19, 2024

jomayeri commented Apr 19, 2024

yuezhao238 commented Apr 20, 2024

jomayeri commented Apr 22, 2024

yuezhao238 commented Apr 22, 2024

wuxb45 commented Apr 25, 2024

kno10 commented May 2, 2024

loadams commented May 6, 2024

kno10 commented May 6, 2024

loadams commented May 7, 2024

jomayeri commented May 22, 2024

[BUG] Tensors are on different devices when model.step() #5422

[BUG] Tensors are on different devices when model.step() #5422

Comments

yuezhao238 commented Apr 16, 2024 • edited Loading

cloudwaysX commented Apr 17, 2024

wuxb45 commented Apr 17, 2024

wuxb45 commented Apr 17, 2024 • edited Loading

tjruwase commented Apr 17, 2024

Kwen-Chen commented Apr 18, 2024

jomayeri commented Apr 18, 2024

yuezhao238 commented Apr 19, 2024

jomayeri commented Apr 19, 2024

yuezhao238 commented Apr 20, 2024

jomayeri commented Apr 22, 2024

yuezhao238 commented Apr 22, 2024

wuxb45 commented Apr 25, 2024

kno10 commented May 2, 2024

loadams commented May 6, 2024

kno10 commented May 6, 2024

loadams commented May 7, 2024

jomayeri commented May 22, 2024

yuezhao238 commented Apr 16, 2024 •

edited

Loading

wuxb45 commented Apr 17, 2024 •

edited

Loading