Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLOv8m训练自己的数据集,几个epoch后loss变0。3090ti,单卡,调整了学习率,取消了mosaic增强。 #166

Open
1 task done
zhangyusen1997 opened this issue Jul 12, 2023 · 16 comments
Assignees

Comments

@zhangyusen1997
Copy link

问题确认 Search before asking

  • 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

截屏2023-07-12 10 04 47 截屏2023-07-12 10 06 56 截屏2023-07-12 10 09 38
@zhangyusen1997
Copy link
Author

换yolov8s也是同样的问题

@nemonameless
Copy link
Collaborator

前几个epoch精度正常吗?前面训练的也截图下。训自己数据集注意加载coco权重做预训练。
单卡资源太少了,建议换更小模型或尺度

@LDX17
Copy link

LDX17 commented Jul 13, 2023

我用yolov8s也出现同样的问题,从11epoch开始,loss为0
image
而且前面10epoch后eval全为0,加载了coco权重做预训练,用的是aistudio上的A100资源
image
image

@nemonameless
Copy link
Collaborator

https://github.com/PaddlePaddle/PaddleYOLO/blob/release/2.6/configs/yolov8/_base_/yolov8_cspdarknet.yml#L2
可以注释syncbn试试,COCO或自定义数据集我们内部训练都是8卡以上GPU资源极多训的。
最后推荐可以尝试使用一下ppyoloe+或rtdetr,训自定义数据集快速收敛的关键的是加载极强的预训练权重。

@sdreamforchen
Copy link

不知道我说得对不对。 V8这些训练问题,是不是assign的问题,或者说选择回归的loss点太多了(没有处理postive sample 和 negative sample)。造成起始的cls_loss这么大!

[07/18 00:44:35] ppdet.engine INFO: Epoch: [0] [ 0/3665] eta: 81 days, 15:43:15 lr: 0.000000 loss: 1312362.750000 loss_cls: 1312202.375000 loss_iou: 160.321213 loss_dfl: 0.000000 loss_l1: 53.665325 batch_cost: 3.8499 data_cost: 0.0013 ips: 8.3118 images/s
[07/18 00:46:30] ppdet.engine INFO: Epoch: [0] [ 100/3665] eta: 24 days, 1:13:14 lr: 0.000001 loss: 1180799.875000 loss_cls: 1180635.000000 loss_iou: 164.957016 loss_dfl: 0.000000 loss_l1: 53.429075 batch_cost: 1.1069 data_cost: 0.7277 ips: 28.9103 images/s
[07/18 00:48:33] ppdet.engine INFO: Epoch: [0] [ 200/3665] eta: 24 days, 18:02:08 lr: 0.000003 loss: 1321773.687500 loss_cls: 1321605.875000 loss_iou: 162.622360 loss_dfl: 0.000000 loss_l1: 52.578604 batch_cost: 1.2006 data_cost: 0.7857 ips: 26.6543 images/s
[07/18 00:50:36] ppdet.engine INFO: Epoch: [0] [ 300/3665] eta: 24 days, 22:23:33 lr: 0.000007 loss: 1342713.875000 loss_cls: 1342555.562500 loss_iou: 151.517464 loss_dfl: 0.000000 loss_l1: 50.486034 batch_cost: 1.1931 data_cost: 0.8108 ips: 26.8212 images/s
[07/18 00:52:35] ppdet.engine INFO: Epoch: [0] [ 400/3665] eta: 24 days, 19:47:10 lr: 0.000013 loss: 991059.343750 loss_cls: 990933.656250 loss_iou: 128.983490 loss_dfl: 0.000000 loss_l1: 42.778625 batch_cost: 1.1555 data_cost: 0.8027 ips: 27.6943 images/s

@nemonameless
Copy link
Collaborator

这个起始loss_cls这么大明显是初始化的问题,你是改了初始化的相关代码吗。先按coco数据集训下看正常的情况,自定义数据集如果训的不正常,应该先排查数据集制作的问题。

@sdreamforchen
Copy link

sdreamforchen commented Jul 18, 2023 via email

@sdreamforchen
Copy link

这个起始loss_cls这么大明显是初始化的问题,你是改了初始化的相关代码吗。先按coco数据集训下看正常的情况,自定义数据集如果训的不正常,应该先排查数据集制作的问题。

未更改初始化,而且在训练前还重新拉取的最新代码。 仅将forward_train\forward_eval\get_loss代码进行了修改

@zhangyusen1997
Copy link
Author

前几个epoch精度正常吗?前面训练的也截图下。训自己数据集注意加载coco权重做预训练。 单卡资源太少了,建议换更小模型或尺度
截图 2023-07-19 11-35-57

@zhangyusen1997
Copy link
Author

数据集应该没问题,跑了ppyoloe,yolof,v3,v5

@sdreamforchen
Copy link

我将ppyoloe的head的DFL去掉,训练正常;
将paddleYOLO的v8的backbone和neck拷贝到paddledetection里,用ppyoloe的head,loss正常,和ppyoloe的差不多。但是40epoch,精度还是为0.

@54wb
Copy link

54wb commented Sep 7, 2023

我用yolov8s也出现同样的问题,从11epoch开始,loss为0 image 而且前面10epoch后eval全为0,加载了coco权重做预训练,用的是aistudio上的A100资源 image image

hi~,我也遇到了同样的问题,训练自己的数据集,加载了预训练权重,使用bs=16单卡训练,在第3个epochloss变0,跟你发出来的情况几乎一样。请问最后有解决办法嘛

@sdreamforchen
Copy link

sdreamforchen commented Sep 7, 2023 via email

@54wb
Copy link

54wb commented Sep 7, 2023

大佬可以交流一下嘛,PPYOLO要求paddle2.4以上,我目前用的ppdetection是2.3,这个可以直接把代码迁移过去嘛

@sdreamforchen
Copy link

sdreamforchen commented Sep 7, 2023 via email

@nemonameless
Copy link
Collaborator

感谢建议,后续会排查下这个问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants