add visible config env for ascend npu #5369

eigen2017 · 2024-04-05T20:51:08Z

when select gpu id on ascend npu, CUDA_VISIBLE_DEVICES is not work.
ascend npu visible config env is NPU_VISIBLE_DEVICES and ASCEND_RT_VISIBLE_DEVICES.
this pull request will make deepspeed configs work on ascend npu:

deepspeed --num_gpus=4
deepspeed --include="work1:2,3@work2:0,1"

and CUDA_VISIBLE_DEVICES is for cuda gpus, NPU_VISIBLE_DEVICES&ASCEND_RT_VISIBLE_DEVICES are for ascend npus, which means this commit will not affect deepspeed supporting cuda gpus.

this commit has tested on Atlas-800t-A2, each one has 8 ascend npus(910b)
910b is the state-of-art of ascend npu series.

…VISIBLE_DEVICES and ASCEND_RT_VISIBLE_DEVICES

eigen2017 · 2024-04-05T20:57:43Z

ascend npu is hardware of HW accelerators

eigen2017 · 2024-04-05T21:05:08Z

and this issue will be solved on ascend npu ^_^
huggingface/accelerate#2368
#3070

CurryRice233 · 2024-04-07T02:01:45Z

I think a better way is get_accelerator().set_visible_devices().

eigen2017 · 2024-04-07T02:55:14Z

I think a better way is get_accelerator().set_visible_devices().

CUDA_VISIBLE_DEVICES and NPU_VISIBLE_DEVICES&ASCEND_RT_VISIBLE_DEVICES are same level concepts.

and this commit is more in line with the architecture design of launch.py

eigen2017 · 2024-04-07T03:21:50Z

more clearly about what issue i have solved:
deepspeed can set which gpu ids would training task running on，for example:
deepspeed --include="localhost:0,2"
will run task on gpu 0 and gpu 2 .
but this setting isnot work on HW npu.
this commit can solve this issue.

CurryRice233 · 2024-04-09T11:53:25Z

I will assign @shiyuan680 to solve this problem within this week. We will create a visible_devices_env() method on the accelerator class, which is more generic to other accelerators.
By the way, NPU_VISIBLE_DEVICES is an old environment, now, just use ASCEND_RT_VISIBLE_DEVICES.

eigen2017 · 2024-04-09T12:34:18Z

I will assign @shiyuan680 to solve this problem within this week. We will create a visible_devices_env() method on the accelerator class, which is more generic to other accelerators. By the way, NPU_VISIBLE_DEVICES is an old environment, now, just use ASCEND_RT_VISIBLE_DEVICES.

that's good, but i first found this issue, and it's my idea to gave a simple solution here, i hope deepspeed merg it first to master branch quikly.
i don't want to wait for a big change of code, and maybe the change includes other features and may hard to be accepted.
and i don't like to see other commit just copy my idea and only add an unnecessary function over my code.

by the way , i'm customer of HW ascend npus and running poc of it , it's not work if only set ASCEND_RT_VISIBLE_DEVICES on my device.

CurryRice233 · 2024-04-10T03:45:49Z

I will assign @shiyuan680 to solve this problem within this week. We will create a visible_devices_env() method on the accelerator class, which is more generic to other accelerators. By the way, NPU_VISIBLE_DEVICES is an old environment, now, just use ASCEND_RT_VISIBLE_DEVICES.

that's good, but i first found this issue, and it's my idea to gave a simple solution here, i hope deepspeed merg it first to master branch quikly. i don't want to wait for a big change of code, and maybe the change includes other features and may hard to be accepted. and i don't like to see other commit just copy my idea and only add an unnecessary function over my code.

by the way , i'm customer of HW ascend npus and running poc of it , it's not work if only set ASCEND_RT_VISIBLE_DEVICES on my device.

We have noticed this problem a long time ago, we have already planned solution. Deepspeed has 6 accelerators, not a good idea hardcode 6 envs here, I think @tjruwase has the same opinion. On the other hand, we do not care who push the code, if you want we can add your name as author (like this), or directly push the code to your repository, and solve with your PR.

about NPU_VISIBLE_DEVICES, is a deprecated env, does not recommend using this variable, I tested on our server, works perfect with only ASCEND_RT_VISIBLE_DEVICES, you can try upgrading Ascend HDK and CANN.

eigen2017 · 2024-04-10T06:57:44Z

I will assign @shiyuan680 to solve this problem within this week. We will create a visible_devices_env() method on the accelerator class, which is more generic to other accelerators. By the way, NPU_VISIBLE_DEVICES is an old environment, now, just use ASCEND_RT_VISIBLE_DEVICES.

that's good, but i first found this issue, and it's my idea to gave a simple solution here, i hope deepspeed merg it first to master branch quikly. i don't want to wait for a big change of code, and maybe the change includes other features and may hard to be accepted. and i don't like to see other commit just copy my idea and only add an unnecessary function over my code.
by the way , i'm customer of HW ascend npus and running poc of it , it's not work if only set ASCEND_RT_VISIBLE_DEVICES on my device.

We have noticed this problem a long time ago, we have already planned solution. Deepspeed has 6 accelerators, not a good idea hardcode 6 envs here, I think @tjruwase has the same opinion. On the other hand, we do not care who push the code, if you want we can add your name as author (like this), or directly push the code to your repository, and solve with your PR.

about NPU_VISIBLE_DEVICES, is a deprecated env, does not recommend using this variable, I tested on our server, works perfect with only ASCEND_RT_VISIBLE_DEVICES, you can try upgrading Ascend HDK and CANN.

co-author is fine, thanks , my account primary email: [email protected]
it's not a big deal of you but it's important for me , because my company needs as many as provements that we contributed to ascend npu applications even only as a customer.

i didn't find any pr or issue discussed this obvious issue like this, i pushed this issue directly to HW supporters and got no solution, if the issue you found long time ago, why no pr and no one of my HW supporters knows how to solve this ?

so i still insist on that , i found this issue and figured out the solution first. so add me to co-author can stop this debate.

and 6 accs has 6 envs , it's right to put them into one function, including cuda accs. i suggest keep NPU_VISIBLE_DEVICES, for compatable for old versions of npus.

my npu-smi info is:
23.0.rc2.2 version 23.0.rc3 910B3
cann version is 7.0

loadams · 2024-04-15T20:59:33Z

Hi @eigen2017 - should we close this PR in favor of #5396?

eigen2017 · 2024-04-16T09:49:25Z

Hi @eigen2017 - should we close this PR in favor of #5396?

surely, thank you for asking me .

@delock

Thank you for [pr](#5369) and @delock contribution of ideas. As mentioned in this [pr](#5369), each device has its own environmental variables. We create visible_devices_envs() and set_visible_devices_envs() methods on the accelerator class to enable each accelerator to implement env settings within the interface , which is more generic to other accelerators. this commit has tested on npu, each one has 8 ascend npus --------- Co-authored-by: yangcheng <[email protected]> Co-authored-by: eigen2017 <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

@delock

Thank you for [pr](microsoft#5369) and @delock contribution of ideas. As mentioned in this [pr](microsoft#5369), each device has its own environmental variables. We create visible_devices_envs() and set_visible_devices_envs() methods on the accelerator class to enable each accelerator to implement env settings within the interface , which is more generic to other accelerators. this commit has tested on npu, each one has 8 ascend npus --------- Co-authored-by: yangcheng <[email protected]> Co-authored-by: eigen2017 <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

@delock

Thank you for [pr](microsoft#5369) and @delock contribution of ideas. As mentioned in this [pr](microsoft#5369), each device has its own environmental variables. We create visible_devices_envs() and set_visible_devices_envs() methods on the accelerator class to enable each accelerator to implement env settings within the interface , which is more generic to other accelerators. this commit has tested on npu, each one has 8 ascend npus --------- Co-authored-by: yangcheng <[email protected]> Co-authored-by: eigen2017 <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

@delock

Thank you for [pr](microsoft#5369) and @delock contribution of ideas. As mentioned in this [pr](microsoft#5369), each device has its own environmental variables. We create visible_devices_envs() and set_visible_devices_envs() methods on the accelerator class to enable each accelerator to implement env settings within the interface , which is more generic to other accelerators. this commit has tested on npu, each one has 8 ascend npus --------- Co-authored-by: yangcheng <[email protected]> Co-authored-by: eigen2017 <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

ascend npu visible config env is not CUDA_VISIBLE_DEVICES, it is NPU_…

9e2f16e

…VISIBLE_DEVICES and ASCEND_RT_VISIBLE_DEVICES

eigen2017 requested review from mrwyattii and awan-10 as code owners April 5, 2024 20:51

shiyuan680 mentioned this pull request Apr 11, 2024

add device config env for the accelerator #5396

Merged

Merge branch 'master' into ascend-npu

8071b8b

eigen2017 closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add visible config env for ascend npu #5369

add visible config env for ascend npu #5369

eigen2017 commented Apr 5, 2024

eigen2017 commented Apr 5, 2024

eigen2017 commented Apr 5, 2024

CurryRice233 commented Apr 7, 2024

eigen2017 commented Apr 7, 2024

eigen2017 commented Apr 7, 2024

CurryRice233 commented Apr 9, 2024

eigen2017 commented Apr 9, 2024

CurryRice233 commented Apr 10, 2024

eigen2017 commented Apr 10, 2024

loadams commented Apr 15, 2024

eigen2017 commented Apr 16, 2024

add visible config env for ascend npu #5369

add visible config env for ascend npu #5369

Conversation

eigen2017 commented Apr 5, 2024

eigen2017 commented Apr 5, 2024

eigen2017 commented Apr 5, 2024

CurryRice233 commented Apr 7, 2024

eigen2017 commented Apr 7, 2024

eigen2017 commented Apr 7, 2024

CurryRice233 commented Apr 9, 2024

eigen2017 commented Apr 9, 2024

CurryRice233 commented Apr 10, 2024

eigen2017 commented Apr 10, 2024

loadams commented Apr 15, 2024

eigen2017 commented Apr 16, 2024