-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Tensors are on different devices when model.step() #5422
Comments
I have the same issue here when running the following script. https://github.com/allenai/open-instruct/blob/main/scripts/dpo_train_with_accelerate.sh. I notice that version 0.14.0 does not have this issue. |
I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works. Here is a snippet of the backtrace (but I'm sorry that I cannot provide the python code. Hope it helps):
|
@tjruwase @mrwyattii The commit: |
@wuxb45, @Heathcliff-Zhao, and @cloudwaysX thanks for reporting and triaging this issue. |
Yes,I have the same issue when i use the deepspeed's version of 0.14.1, so I do that:
after use the deepspeed of 0.14.0, it worked! |
@jomayeri Can you show all files in thudm/chatglm-6b folder? Are there tokenizer related files in it? |
@Heathcliff-Zhao there are no tokenizer files in it. |
@jomayeri The folder should contain these files below. You can download the missing file from https://huggingface.co/THUDM/chatglm3-6b/tree/main |
@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? |
@jomayeri I use this command. |
Version 0.14.2 still has the same issue. |
I can reproduce this in 0.14.2, it seems to have been reverted in #5461 the day after 0.14.2; so likely fixed in the next version 0.14.3. |
Hi @kno10 - can you confirm that if you build from master that things work? |
I did not try. I downgraded to 0.14.0 to get things back running as quickly as possible. |
@kno10 - makes sense, thanks. Just wanted to confirm this fixed the issue you were hitting. |
Closing with the same comment as #5538. |
Describe the bug
The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.
My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with
CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3
Thanks very much for your precious time!
The text was updated successfully, but these errors were encountered: