Skip to content

feat(fp8): [Work In Progress] enable FP8 training #1183

feat(fp8): [Work In Progress] enable FP8 training

feat(fp8): [Work In Progress] enable FP8 training #1183

Triggered via pull request November 6, 2024 10:31
@zigzagcaizigzagcai
synchronize #369
fp8_enable
Status Failure
Total duration 2h 6m 49s
Artifacts

e2e_test.yaml

on: pull_request
training_4GPU
1m 31s
training_4GPU
training_8GPU_ISP
1m 24s
training_8GPU_ISP
training_8GPU_ISP_CKPT
20m 15s
training_8GPU_ISP_CKPT
training_8GPU_4DP2PP_ZB
50s
training_8GPU_4DP2PP_ZB
Matrix: training_16GPU_4DP2TP2PP_FSP
Matrix: training_16GPU_4DP2TP2PP_MSP
Matrix: training_16GPU_4DP2TP2PP_MTP
Matrix: training_8GPU_4DP2PP
Matrix: training_8GPU_4DP2TP
Matrix: training_8GPU_4DP2TPSP
Matrix: training_internlm2
Matrix: training_llama2
Fit to window
Zoom out
Zoom in

Annotations

25 errors and 15 warnings
training_8GPU_4DP2TP (910B)
Process completed with exit code 1.
training_8GPU_4DP2TPSP (910B)
unable to access 'https://github.com/InternLM/InternEvo/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.
training_8GPU_4DP2TPSP (910B)
unable to access 'https://github.com/InternLM/InternEvo/': Failed to connect to github.com port 443: Connection timed out
training_8GPU_4DP2TPSP (910B)
RPC failed; curl 28 Failed to connect to github.com port 443: Connection timed out
training_8GPU_4DP2TPSP (910B)
expected 'acknowledgments'
training_8GPU_4DP2TPSP (910B)
The process '/usr/local/bin/git' failed with exit code 128
training_4GPU
Process completed with exit code 1.
training_16GPU_4DP2TP2PP_FSP (910B)
RPC failed; curl 28 Failed to connect to github.com port 443: Connection timed out
training_16GPU_4DP2TP2PP_FSP (910B)
expected 'acknowledgments'
training_16GPU_4DP2TP2PP_FSP (910B)
Process completed with exit code 1.
training_16GPU_4DP2TP2PP_MSP (910B)
Process completed with exit code 1.
training_16GPU_4DP2TP2PP_MTP (910B)
unable to access 'https://github.com/InternLM/InternEvo/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.
training_16GPU_4DP2TP2PP_MTP (910B)
RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
training_16GPU_4DP2TP2PP_MTP (910B)
expected 'acknowledgments'
training_16GPU_4DP2TP2PP_MTP (910B)
unable to access 'https://github.com/InternLM/InternEvo/': Failed to connect to github.com port 443: Connection timed out
training_16GPU_4DP2TP2PP_MTP (910B)
The process '/usr/local/bin/git' failed with exit code 128
training_8GPU_4DP2PP (910B)
Process completed with exit code 1.
training_8GPU_4DP2PP (910B)
unable to access 'https://github.com/InternLM/InternEvo/': Failed to connect to github.com port 443: Connection timed out
training_internlm2 (910B)
Process completed with exit code 1.
training_llama2 (910B)
Process completed with exit code 1.
training_llama2 (910B)
unable to access 'https://github.com/InternLM/InternEvo/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.
training_8GPU_ISP
Process completed with exit code 143.
training_8GPU_ISP_CKPT
The job running on runner evo_t_cluster_two has exceeded the maximum execution time of 20 minutes.
training_8GPU_ISP_CKPT
The operation was canceled.
training_8GPU_4DP2PP_ZB
Process completed with exit code 143.
training_8GPU_4DP2TP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_8GPU_4DP2TPSP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_4GPU
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_4GPU
The Actions runner will no longer support your OS version on November 1, 2024. Please upgrade to a supported version. For information, refer https://github.blog/changelog/2024-08-19-notice-of-upcoming-deprecations-and-breaking-changes-in-github-actions-runners/
training_16GPU_4DP2TP2PP_FSP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_16GPU_4DP2TP2PP_MSP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_16GPU_4DP2TP2PP_MTP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_8GPU_4DP2PP (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_internlm2 (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_llama2 (910B)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_8GPU_ISP
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_8GPU_ISP
The Actions runner will no longer support your OS version on November 1, 2024. Please upgrade to a supported version. For information, refer https://github.blog/changelog/2024-08-19-notice-of-upcoming-deprecations-and-breaking-changes-in-github-actions-runners/
training_8GPU_ISP_CKPT
The Actions runner will no longer support your OS version on November 1, 2024. Please upgrade to a supported version. For information, refer https://github.blog/changelog/2024-08-19-notice-of-upcoming-deprecations-and-breaking-changes-in-github-actions-runners/
training_8GPU_4DP2PP_ZB
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
training_8GPU_4DP2PP_ZB
The Actions runner will no longer support your OS version on November 1, 2024. Please upgrade to a supported version. For information, refer https://github.blog/changelog/2024-08-19-notice-of-upcoming-deprecations-and-breaking-changes-in-github-actions-runners/