Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)). #564

Open
linengcs opened this issue Jul 5, 2024 · 3 comments

Comments

@linengcs
Copy link

linengcs commented Jul 5, 2024

Describe the bug

在执行self.optimizer.step(main_loss)时报错如下:

Traceback (most recent call last):
  File "train_edge.py", line 506, in <module>
    trainer.train_edge()
  File "train_edge.py", line 290, in train_edge
    self.optimizer.step(main_loss)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 305, in step
    self.pre_step(loss, retain_graph=False)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 220, in pre_step
    self.backward(loss, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 173, in backward
    grads = jt.grad(loss, params_has_grad, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/__init__.py", line 445, in grad
    return core.grad(loss, targets, retain_graph)
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)).

其中optimizer选择的是SGD

self.optimizer = jt.optim.SGD(params_list,
                                      lr=args.lr,
                                      momentum=args.momentum,
                                      weight_decay=args.weight_decay)

传入的loss:jt.Var([4.33252289], dtype=float64)

Full Log

(fdlnet_j) llf@XY-TITAN-RTX:/home/ubuntu/hdd2/llf/fdlnet_jittor/scripts$ python train_edge.py --model fdlnet --backbone resnet50 --dataset night --aux
[i 0705 15:10:15.537224 52 compiler.py:956] Jittor(1.3.8.5) src: /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor
[i 0705 15:10:15.545380 52 compiler.py:957] g++ at /usr/bin/g++(5.5.0)
[i 0705 15:10:15.545582 52 compiler.py:958] cache_path: /home/llf/.cache/jittor/jt1.3.8/g++5.5.0/py3.8.19/Linux-4.15.0-1x37/IntelRXeonRGolx4e/default
[i 0705 15:10:15.579173 52 install_cuda.py:93] cuda_driver_version: [12, 1]
[i 0705 15:10:15.579814 52 install_cuda.py:81] restart /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/bin/python ['train_edge.py', '--model', 'fdlnet', '--backbone', 'resnet50', '--dataset', 'night', '--aux']
[i 0705 15:10:15.903714 16 compiler.py:956] Jittor(1.3.8.5) src: /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor
[i 0705 15:10:15.910872 16 compiler.py:957] g++ at /usr/bin/g++(5.5.0)
[i 0705 15:10:15.911057 16 compiler.py:958] cache_path: /home/llf/.cache/jittor/jt1.3.8/g++5.5.0/py3.8.19/Linux-4.15.0-1x37/IntelRXeonRGolx4e/default
[i 0705 15:10:15.944564 16 install_cuda.py:93] cuda_driver_version: [12, 1]
[i 0705 15:10:15.954342 16 __init__.py:411] Found /home/llf/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc(11.2.152) at /home/llf/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc.
[i 0705 15:10:16.037728 16 __init__.py:411] Found gdb(8.1.1) at /usr/bin/gdb.
[i 0705 15:10:16.046927 16 __init__.py:411] Found addr2line(2.30) at /usr/bin/addr2line.
[i 0705 15:10:16.301866 16 compiler.py:1011] cuda key:cu11.2.152_sm_75
[i 0705 15:10:16.767486 16 __init__.py:227] Total mem: 125.56GB, using 16 procs for compiling.
[i 0705 15:10:16.866903 16 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0705 15:10:17.003635 16 init.cc:62] Found cuda archs: [75,]
[i 0705 15:10:17.038976 16 __init__.py:411] Found mpicc(2.1.1) at /usr/bin/mpicc.
[i 0705 15:10:18.680663 16 cuda_flags.cc:49] CUDA enabled.
2024-07-05 15:10:18,788 test INFO: Using 1 GPUs
2024-07-05 15:10:18,788 test INFO: Namespace(att_weight=0.01, aux=True, aux_weight=0.4, backbone='resnet50', base_size=512, batch_size=2, best_recode={'epoch': -1, 'mean_iu': 0}, crop_size=384, dataset='night', date_str='2024_07_05_15_10_18', device='cuda', distributed=False, edge_weight=0.01, epochs=260, flip=False, joint_edgeseg_loss=False, jpu=False, l2_weight=0, last_recode={}, local_rank=0, log_dir='../runs/logs/', log_iter=20, lr=0.005, manual_seed=40171, model='fdlnet', momentum=0.9, no_cuda=False, num_gpus=1, resume=None, save_dir='../runs/ckpt', save_epoch=20, seg_weight=1.0, skip_val=False, start_epoch=0, use_ohem=False, val_epoch=1, warmup_factor=0.3333333333333333, warmup_iters=0, warmup_method='linear', weight_decay=0.0005, workers=12)
Found 2998 images in the folder ../../datasets/night/images/train
Found 1299 images in the folder ../../datasets/night/images/val
[w 0705 15:10:19.370889 16 nn.py:2280]  The `Parameter` interface isn't needed in Jittor, this interface
does nothings and it is just used for compatible.

A Jittor Var is a Parameter
when it is a member of Module, if you don't want a Jittor
Var menber is treated as a Parameter, just name it startswith
underscore `_`.

2024-07-05 15:10:19,373 test INFO: Start training, Total Epochs: 260 = Total Iterations 389740
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]

Compiling Operators(1/1) used: 2.96s eta:    0s

Compiling Operators(1/1) used: 2.95s eta:    0s
Traceback (most recent call last):
  File "train_edge.py", line 506, in <module>
    trainer.train_edge()
  File "train_edge.py", line 290, in train_edge
    self.optimizer.step(main_loss)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 305, in step
    self.pre_step(loss, retain_graph=False)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 220, in pre_step
    self.backward(loss, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 173, in backward
    grads = jt.grad(loss, params_has_grad, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/__init__.py", line 445, in grad
    return core.grad(loss, targets, retain_graph)
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)).

Types of your inputs are:
 self	= module,
 args	= (Var, list, bool, ),

The function declarations are:
 vector<VarHolder*> _grad(VarHolder* loss, const vector<VarHolder*>& targets, bool retain_graph=true)

Failed reason:[f 0705 15:10:28.107652 16 cublas_batched_matmul_op.cc:34] Check failed: a->dtype().dsize() == b->dtype().dsize()  Something wrong... Could you please report this issue?
 type of two inputs should be the same

Minimal Reproduce

        for iteration, (images, targets, edge, _) in enumerate(self.train_dataloader):
            batch_pixel_size = images.size(0) * images.size(2) * images.size(3)

            # print(images.shape, targets.shape)
            iteration = iteration + 1

            main_loss = None
            loss_dict = self.model(images, gts=(targets, edge))

            if args.seg_weight > 0:
                log_seg_loss = loss_dict['seg_loss'].mean().clone().detach()
                train_seg_loss.update(log_seg_loss.item(), batch_pixel_size)
                main_loss = loss_dict['seg_loss']

            if args.aux_weight > 0:
                log_aux_loss = loss_dict['aux_loss'].mean().clone().detach()
                train_aux_loss.update(log_aux_loss.item(), batch_pixel_size)
                main_loss += loss_dict['aux_loss']

            if args.att_weight > 0:
                log_att_loss = loss_dict['att_loss'].mean().clone().detach()
                train_att_loss.update(log_att_loss.item(), batch_pixel_size)
                main_loss += loss_dict['att_loss']

            main_loss = main_loss.mean()
            log_main_loss = main_loss.clone().detach()

            train_main_loss.update(log_main_loss.item(), batch_pixel_size)

            self.optimizer.step(main_loss)
@LDYang694
Copy link
Collaborator

数据的运算过程中出现了float64和float32混用的地方,可能的原因出现在将numpy的array转换成jt的Var,因为np 初始化array默认是float64

@linengcs
Copy link
Author

我调整后,全程调试检查了数据类型都是float32,但是还是一样的报错

@linengcs

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants