Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(portal-server/mis-server): 启动时跳过已停用集群的ssh检查 #1347

Merged
merged 13 commits into from
Jul 17, 2024

Conversation

piccaSun
Copy link
Contributor

@piccaSun piccaSun commented Jul 9, 2024

背景

在实际应用中会出现集群所在机器关机,或者所在网络不可用等情况,管理员可能会在出现问题之后在管理系统页面将集群停用
所以在启用时,应只检查启用中的集群是否满足启用条件

修改

此PR进行以下修改

  1. 门户系统
  • 系统启动时只对启用中集群登录节点进行免密检查
  • 系统启动时只对启用中集群的代理网关自动设置进行设置(启用时不会再进行代理网关自动启动处理,如配置了自动启动代理网关,默认集群初始化时一定已经进行过此设置)
  • 上述两点如未配置管理系统,则对所有集群进行检查设置
  1. 管理系统
  • 系统启动时只对启用中集群登录节点进行免密检查

  • 在插入公钥接口 insertKeyToNewUser接口中增加 clusters 参数,现阶段由于插入公钥失败不影响创建用户,是对所有集群执行

  • 在启用集群 activateCluster接口中增加对启用集群登录节点检查免密的操作,若失败将报错,无法启用集群

  • 增加修改对price plugin中使用callOnAll进行多集群操作时,如果适配器请求失败会抛出错误的部分
    作业价格表设置页面在之前的某个Issue中已经修改了不报错,只显示当前可用集群的价格信息,所以现在对price plugin中多集群处理修改为不抛出错误,在logger中进行提示
    上述修改可以解决下面两个问题
    1.系统使用过程中,当多集群下某个适配器请求失败时,mis-server由于后台fetchJobs的执行会造成无法连接
    2.系统启动时,某一个适配器请求失败,mis-server或依赖于mis-serverportal-server无法启动

    当适配器正常连接时,随着打开作业价格表或后台fetchJobs的执行,没有请求到的集群价格设置数据可以再次获取

修改后

本地docker-cluster测试

1.正常启用中管理系统和门户系统页面正常
image
image

2.login,c1,slurm全部stop的情况,管理系统页面和门户系统页面
报错或无集群数据
image
image
image
image
image

3.直接操作停用集群
管理系统不显示停用集群数据
image
门户系统不显示停用集群数据,正在访问的集群会报错
image
image

4.在login,c1,slurm全部已停用并操作了集群停用后重启scow,集群管理页面因集群异常不显示启用按键
image
image

5.重新开启login,c1,slurm,操作启用集群
image
image
image
image
image

image

Copy link

changeset-bot bot commented Jul 9, 2024

🦋 Changeset detected

Latest commit: 9e60741

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 7 packages
Name Type
@scow/portal-server Patch
@scow/mis-server Patch
@scow/auth Patch
@scow/gateway Patch
@scow/mis-web Patch
@scow/portal-web Patch
@scow/cli Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@piccaSun piccaSun marked this pull request as ready for review July 10, 2024 12:44
@pkuhpc-review-bot pkuhpc-review-bot bot added the Code-ReviewRequested Code Review Requested label Jul 10, 2024
@pkuhpc-review-bot pkuhpc-review-bot bot requested a review from ddadaal July 10, 2024 12:44
@piccaSun
Copy link
Contributor Author

piccaSun commented Jul 10, 2024

此PR需要测试老师介入测试,实际部署环境下多集群停机,停用,scow重启等情况

@pkuhpc-review-bot pkuhpc-review-bot bot added Code-Approved Code Review approved ReadyForMerge Ready for merge and removed Code-ReviewRequested Code Review Requested labels Jul 10, 2024
@pkuhpc-review-bot pkuhpc-review-bot bot added E2E-ReviewRequested E2E Test requested and removed ReadyForMerge Ready for merge labels Jul 10, 2024
@lyl-available
Copy link
Contributor

关闭适配器后,fetch job报错,管理平台无法启动

@pkuhpc-review-bot pkuhpc-review-bot bot added E2E-Approved E2E Test approved ReadyForMerge Ready for merge and removed E2E-ReviewRequested E2E Test requested labels Jul 15, 2024
@piccaSun
Copy link
Contributor Author

piccaSun commented Jul 16, 2024

@ddadaal 由于测试老师在测试过程中发现后台fetchJobs的自动执行会导致mis-server服务停机,不断重启,所以增加以下修改。
已经测试通过并在pr说明中进行了以下补充,请再次确认此部分的更改

增加修改对price plugin中使用callOnAll进行多集群操作时,如果适配器请求失败会抛出错误的部分
作业价格表设置页面在之前的某个Issue中已经修改了不报错,只显示当前可用集群的价格信息,所以现在对price plugin中多集群处理修改为不抛出错误,在logger中进行提示
上述修改可以解决下面两个问题
1.系统使用过程中,当多集群下某个适配器请求失败时,mis-server由于后台fetchJobs的执行会造成无法连接
2.系统启动时,某一个适配器请求失败,mis-server或依赖于mis-server的portal-server无法启动

当适配器正常连接时,随着打开作业价格表或后台fetchJobs的执行,没有请求到的集群价格设置数据可以再次获取

6b159ac

@piccaSun piccaSun requested a review from ddadaal July 16, 2024 09:08
@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ReviewRequested Code Review Requested and removed Code-Approved Code Review approved ReadyForMerge Ready for merge labels Jul 16, 2024
@pkuhpc-review-bot pkuhpc-review-bot bot added Code-Approved Code Review approved ReadyForMerge Ready for merge and removed Code-ReviewRequested Code Review Requested labels Jul 16, 2024
@ddadaal ddadaal merged commit 6eebd35 into master Jul 17, 2024
9 checks passed
@ddadaal ddadaal deleted the fix-skip-deactivated-cluster-ssh branch July 17, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code-Approved Code Review approved E2E-Approved E2E Test approved ReadyForMerge Ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants