Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡运行 和 参数问题 #12

Open
zmr66z6xx6 opened this issue Sep 3, 2024 · 40 comments
Open

多卡运行 和 参数问题 #12

zmr66z6xx6 opened this issue Sep 3, 2024 · 40 comments

Comments

@zmr66z6xx6
Copy link

为什么项目使用多 GPU 运行 导致推理结果乱码,得到的评估结果很差呢 ?请问是什么原因导致的呢
另外一个问题是,论文说的是实验运用llama2的默认参数,比如温度等。但是实际推理时好像用的是llama-factory的参数,是0.95。而模型的默认温度是0.6。

@rickyang1114
Copy link
Collaborator

rickyang1114 commented Sep 3, 2024

您好,感谢您对我们项目的兴趣!

本项目绝大多数实验仅使用了单卡,多卡推理的问题可以参考LLaMA-Factory原仓库。参数问题以控制台实际输出为准。

@zmr66z6xx6
Copy link
Author

您好,感谢您对我们项目的兴趣!

本项目绝大多数实验仅使用了单卡,多卡推理的问题可以参考LLaMA-Factory原仓库。参数问题以控制台实际输出为准。

哦噢谢谢。那请问参数问题呢?

@rickyang1114
Copy link
Collaborator

应该就是llama-factory的默认参数,我没有调这些

@zmr66z6xx6
Copy link
Author

收到
[WARNING|logging.py:328] 2024-09-03 16:13:36,533 >> We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
image
最后就是还想问问这个warning会有影响吗?

@rickyang1114
Copy link
Collaborator

我觉得没有影响。我在实验的时候没管warnings

@zmr66z6xx6
Copy link
Author

okok感谢了
我跑的gsm8k_test结果如下:
image
好像还是跟结果有点差距了 两行 上面一行是论文的结果 下面一行是我跑的
还请指教一下

@rickyang1114
Copy link
Collaborator

可能是环境原因吧...

@zmr66z6xx6
Copy link
Author

哦噢好的感谢

@SivilTaram
Copy link
Collaborator

SivilTaram commented Sep 3, 2024

okok感谢了 我跑的gsm8k_test结果如下: image 好像还是跟结果有点差距了 两行 上面一行是论文的结果 下面一行是我跑的 还请指教一下

@zmr66z6xx6 能否详细说一下这个具体的setup呀?是用的本repo提供的command跑的evaluation吗,还是改动了什么参数?是否检查过模型self-disilltation 生成的数据呢?

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 另外,这个出问题是在单卡环境下,还是多卡环境下呢?

@zmr66z6xx6
Copy link
Author

okok感谢了 我跑的gsm8k_test结果如下: image 好像还是跟结果有点差距了 两行 上面一行是论文的结果 下面一行是我跑的 还请指教一下

@zmr66z6xx6 能否详细说一下这个具体的setup呀?是用的本repo提供的command跑的evaluation吗,还是改动了什么参数?是否检查过模型self-disilltation 生成的数据呢?

main分支跑的哟 参数没改 就用原来的

@zmr66z6xx6
Copy link
Author

@zmr66z6xx6 另外,这个出问题是在单卡环境下,还是多卡环境下呢?

多卡 乱码结果很差。上面发的是单卡跑出来的

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 可以试试reproduce branch的code,不确定是不是因为Llama-Factory最新的codebase引起的问题

@zmr66z6xx6
Copy link
Author

@SivilTaram 收到谢谢

@zmr66z6xx6
Copy link
Author

@SivilTaram 请问一下这个warning有影响吗
image

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 没有影响的,这个是说这个API马上会弃用

@zmr66z6xx6
Copy link
Author

@zmr66z6xx6 没有影响的,这个是说这个API马上会弃用

@SivilTaram 好的 谢谢 目前用 分支部分 跑出了seed结果 openfunctions的结果差的有点多了 只有10.71%

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 是指seed model自己inference的结果在openfunctions只有10.71%,是吗?

@zmr66z6xx6
Copy link
Author

@SivilTaram 对的 之前main分支也是openfunctions上的test不太理想

@SivilTaram
Copy link
Collaborator

SivilTaram commented Sep 5, 2024

@zmr66z6xx6 因为seed model本身和方法没有任何关系,就是llama-2-chat,请问你是用什么精度做的inference,什么显卡呢?以及是只有openfunctions上的结果不理想吗还是?

@zmr66z6xx6
Copy link
Author

@SivilTaram 我参数啥的都没改 全是项目里头指定的 卡是RTX 3090
对的目前只是openfunctions test差得多一些。

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 好的,谢谢反馈!可以先在reproduce下试试其他的dataset,比如GSM8K是否能复现sdft v.s. sft 的结果吗?感觉听起来像是硬件支持精度的问题😂 但我还不太确定

@zmr66z6xx6
Copy link
Author

@SivilTaram 好的收到,感谢

@zmr66z6xx6
Copy link
Author

@SivilTaram gsm8k数据集训练的结果:感觉还是openfunction上的出入有点大
image
上述SDFT的结果和论文对上了,但是前两项差了。对了之前说的精度问题,3090好像支持bf16精度的。顺便问问论文是什么卡跑出来的

@rickyang1114
Copy link
Collaborator

部分实验用3090,部分用A800

@zmr66z6xx6
Copy link
Author

@rickyang1114 还请问一下为什么分支在gsm8k跑出来的结果和论文对不上,没有出现论文表现的遗忘

@rickyang1114
Copy link
Collaborator

可能是因为有一些环境方面的微小差异导致随机性未能完全被抹去= =

@zmr66z6xx6
Copy link
Author

但是这里出现的openfunction效果增长了这么多着实有点奇怪wwww,对了还要问一下论文跑predict的时候用了do_sample吗?我在部分任务上跑了几次发现正确率是一模一样的

@rickyang1114
Copy link
Collaborator

humaneval 评估太慢了,用了do_sample False来加快,其他地方都是LLaMA-Factory predict 的默认配置,应该是有sample;在同一个环境下多次执行结果不变是正常的,因为LLaMA-Factory固定了随机种子。

结果未能完全复现可能是我当时做实验的环境和复现的环境不是完全一样,可能由requirements.txt中某些未指定版本的package带来,也可能由操作系统带来。。。具体是什么原因我也不清楚。。。

@zmr66z6xx6
Copy link
Author

好的好的谢谢

@SivilTaram
Copy link
Collaborator

SivilTaram commented Sep 12, 2024

@zmr66z6xx6 openfunctions 的性能我觉得可能是因为do sample的原因,可以试试打开do sample 试多次看看?因为humaneval本身的example数量太少了,很容易导致variance比较大;

另一个问题就是,seed model 如果用greedy(do sample=False)理应复现论文中的结果,现在看seed model的性能都不能match,很奇怪...

@zmr66z6xx6
Copy link
Author

@rickyang1114 哦噢 我看项目中seed脚本没有对do_sample指定 我稍后 指定其为False然后跑一下试试 (上述得到的结果我没有改动任何地方)

@zmr66z6xx6
Copy link
Author

@rickyang1114 还要麻烦请问一下HumanEval 测试 要用到api吗?报错提示找不到dataset 该怎么办

@rickyang1114
Copy link
Collaborator

检查一下bigcode-evaluation-harness是否为空目录?我没有遇到过这个问题

@zmr66z6xx6
Copy link
Author

@rickyang1114 是不是我的服务器没办法连外网导致的呢,数据是从hub上在线抓取的吗?

@rickyang1114
Copy link
Collaborator

很有可能。可以试试export HF_ENDPOINT=https://hf-mirror.com 或者使用代理

@zmr66z6xx6
Copy link
Author

OK感谢

@rickyang1114
Copy link
Collaborator

先前对于openfunction数据集的评估只匹配了模型输出的keyword argument,而未考虑position argument,存在将正确答案误判的情况。例如一个样例的标签为:plant.get_scientific_name(common_name="rose"),而模型的输出为plant.get_scientific_name("rose")。对此,我在reproduce分支更新了对该数据集的评估函数,为其赋予0.5的权重,从而更好地对模型的输出进行评估。

此外,由于先前的实验环境已经丢失,我按照reproduce分支的requirements.txt重新构建了环境并且进行了实验,以下将实验结果粘贴:

test_seed_LM.sh

Evaluation on seed LM.

Evaluation on gsm8k:
Accuracy for math: 380 / 1319 = 28.81%

Evaluation on multiarith:
Accuracy for math: 130 / 180 = 72.22%

Evaluation on OpenFunctions:
Accuracy for openfunction: 23.5 / 112 = 20.98%

Evaluation on HumanEval:
Accuracy for HumanEval: 14.63%

Evaluation on raw safety:
file: predictions/seed/advbench-raw/generated_predictions.jsonl, safe_rate: 99.42%

Evaluation on jailbreak safety:
file: predictions/seed/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 94.81%

Evaluation on MMLU:
        Average: 46.42
           STEM: 35.80
Social Sciences: 53.05
     Humanities: 43.35
          Other: 54.46

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.26%
Accuracy for ai2_arc: 64.06%
Accuracy for hellaswag: 57.80%
Accuracy for winogrande: 66.38%

gsm8k/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 386 / 1319 = 29.26%

Evaluation on multiarith:
Accuracy for math: 140 / 180 = 77.78%

Evaluation on OpenFunctions:
Accuracy for openfunction: 22.5 / 112 = 20.09%

Evaluation on HumanEval:
Accuracy for HumanEval: 14.63%

Evaluation on raw safety:
file: predictions/gsm8k/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 85.38%

Evaluation on jailbreak safety:
file: predictions/gsm8k/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 53.08%

Evaluation on MMLU:
        Average: 42.98
           STEM: 33.80
Social Sciences: 47.60
     Humanities: 40.81
          Other: 50.28

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 31.28%
Accuracy for ai2_arc: 64.01%
Accuracy for hellaswag: 56.76%
Accuracy for winogrande: 68.11%

gsm8k/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 452 / 1319 = 34.27%

Evaluation on multiarith:
Accuracy for math: 155 / 180 = 86.11%

Evaluation on OpenFunctions:
Accuracy for openfunction: 25.0 / 112 = 22.32%

Evaluation on HumanEval:
Accuracy for HumanEval: 16.46%

Evaluation on raw safety:
file: predictions/gsm8k/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 94.81%

Evaluation on jailbreak safety:
file: predictions/gsm8k/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 79.62%

Evaluation on MMLU:
        Average: 45.83
           STEM: 35.43
Social Sciences: 53.02
     Humanities: 42.71
          Other: 53.19

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 32.72%
Accuracy for ai2_arc: 62.37%
Accuracy for hellaswag: 56.55%
Accuracy for winogrande: 67.40%

openfunction/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 289 / 1319 = 21.91%

Evaluation on multiarith:
Accuracy for math: 114 / 180 = 63.33%

Evaluation on OpenFunctions:
Accuracy for openfunction: 39 / 112 = 34.82%

Evaluation on HumanEval:
Accuracy for HumanEval: 6.71%

Evaluation on raw safety:
file: predictions/openfunction/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 99.23%

Evaluation on jailbreak safety:
file: predictions/openfunction/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 94.62%

Evaluation on MMLU:
        Average: 46.64
           STEM: 36.07
Social Sciences: 53.80
     Humanities: 43.35
          Other: 54.46

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.69%
Accuracy for ai2_arc: 63.59%
Accuracy for hellaswag: 57.51%
Accuracy for winogrande: 66.46%

openfunction/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 360 / 1319 = 27.29%

Evaluation on multiarith:
Accuracy for math: 126 / 180 = 70.00%

Evaluation on OpenFunctions:
Accuracy for openfunction: 41 / 112 = 36.61%

Evaluation on HumanEval:
Accuracy for HumanEval: 15.24%

Evaluation on raw safety:
file: predictions/openfunction/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 99.62%

Evaluation on jailbreak safety:
file: predictions/openfunction/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 97.31%

Evaluation on MMLU:
        Average: 46.49
           STEM: 35.93
Social Sciences: 52.85
     Humanities: 43.73
          Other: 54.28

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.06%
Accuracy for ai2_arc: 63.53%
Accuracy for hellaswag: 57.16%
Accuracy for winogrande: 66.46%

magicoder/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 314 / 1319 = 23.81%

Evaluation on multiarith:
Accuracy for math: 120 / 180 = 66.67%

Evaluation on OpenFunctions:
Accuracy for openfunction: 5.5 / 112 = 4.91%

Evaluation on HumanEval:
Accuracy for HumanEval: 18.90%

Evaluation on raw safety:
file: predictions/magicoder/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 90.00%

Evaluation on jailbreak safety:
file: predictions/magicoder/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 70.00%

Evaluation on MMLU:
        Average: 46.56
           STEM: 35.90
Social Sciences: 53.34
     Humanities: 43.61
          Other: 54.34

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.73%
Accuracy for ai2_arc: 64.35%
Accuracy for hellaswag: 57.34%
Accuracy for winogrande: 67.17%

magicoder/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 330 / 1319 = 25.02%

Evaluation on multiarith:
Accuracy for math: 114 / 180 = 63.33%

Evaluation on OpenFunctions:
Accuracy for openfunction: 7.5 / 112 = 6.70%

Evaluation on HumanEval:
Accuracy for HumanEval: 20.12%

Evaluation on raw safety:
file: predictions/magicoder/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 98.27%

Evaluation on jailbreak safety:
file: predictions/magicoder/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 90.38%

Evaluation on MMLU:
        Average: 46.54
           STEM: 36.10
Social Sciences: 53.12
     Humanities: 43.29
          Other: 54.71

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.79%
Accuracy for ai2_arc: 64.23%
Accuracy for hellaswag: 57.31%
Accuracy for winogrande: 67.17%

可以看出,结果与论文中的数值存在一些波动,但是仍然能体现sdft相对于sft的优势。

此外,由于本项目使用去年12月左右的LLaMA-Factory构建,彼时其并不支持多卡推理,因而使用多卡可能出现未预期的错误,请和script示范中一样使用单卡。

@rickyang1114 rickyang1114 reopened this Sep 19, 2024
@zmr66z6xx6
Copy link
Author

好的 感谢

@SivilTaram
Copy link
Collaborator

@zmr66z6xx6 请再试试是否可以复现上述结果哈,欢迎更多feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants