-
所有节点OS版本:ubuntu 18.04
-
所有节点root密码一致且允许SSH访问
-
所有节点配置短域名解析
/etc/hosts配置如下 (依单master双worker节点为例)
127.0.0.1 localhost your_master_local_ip master your_master_local_ip master.sigsus.cn your_worker_local_ip worker01 your_worker_local_ip worker01.sigsus.cn your_second_worker_local_ip worker02 your_second_worker_local_ip worker02.sigsus.cn
-
配置harbor(master节点)
名称 | REPO地址 | 分支 |
---|---|---|
DLTS Main Project | https://github.com/apulis/apulis_platform.git | V1.0.0 |
AIArts-Frontend | https://github.com/apulis/AIArts-Frontend.git | V1.0.0 |
AIArts-Backend | https://github.com/apulis/AIArts-Backend.git | V1.0.0 |
user-dashboard-frontend | https://github.com/apulis/user-dashboard-frontend.git | V1.0.0 |
user-dashboard-backend | https://github.com/apulis/user-dashboard-backend.git | V1.0.0 |
image-label-frontend | https://github.com/apulis/NewObjectLabel | master |
image-label | https://github.com/apulis/data-platform-backend.git | dev |
ascend-for-volcano | https://github.com/apulis/ascend-for-volcano | huaweidls-0.4.0 |
ascend-device-plugin | https://github.com/apulis/ascend-device-plugin | v1.0.0 |
kfserving | https://github.com/apulis/kfserving.git | 0.2.2 |
-
创建虚拟环境
virtualenv -p python2.7 pythonenv2.7 . pythonenv2.7/bin/activate
-
安装python安装包
cd DLWorkspace/src/ClusterBootstrap/ pip install -r scripts/requirements.txt
-
安装golang(可选;Atlas特定组件编译;仅当集群存在NPU计算设备时使用)
编译restfulapi2
cd DLWorkspace/src/ClusterBootstrap/
./deploy.py docker build restfulapi2
编译init-container
./deploy.py docker push init-container
编译job-exporter
./deploy.py docker push job-exporter
编译gpu-reporter
./deploy.py docker push gpu-reporter
编译watchdog
./deploy.py docker push watchdog
编译repairmanager2
./deploy.py docker push repairmanager2
cd AIArts-Frontend/
docker build -t dlworkspace_aiarts-frontend:1.0.0 .
cd AIArtsBackend/deployment/
bash build.sh
cd user-dashboard-frontend/
docker build -t dlworkspace_custom-user-dashboard-frontend:latest .
cd user-dashboard-backend/
docker build -t dlworkspace_custom-user-dashboard-backend:latest .
cd NewObjectLabel/
docker build -t dlworkspace_image-label:latest .
cd DLWorkspace/src/ClusterBootstrap/
./deploy.py docker push data-platform-backend
-
创建目录
mkdir -p ${GOPATH}/{src/github.com/google,src/k8s.io,src/volcano.sh}
-
将软件包中获取的ascend-for-volcano文件夹上传到“${GOPATH}/src/volcano.sh/“目录下,并将文件夹重命名为volcano
-
创建build文件夹
cd ${GOPATH}/src/volcano.sh/volcano/ mkdir -p build
-
创建并编辑build.sh
执行
cd ${GOPATH}/src/volcano.sh/volcano/build
执行
vim build.sh
, 输入:#!/bin/sh cd ${GOPATH}/src/volcano.sh/volcano/ make clean export PATH=$GOPATH/bin:$PATH export GO111MODULE=off export GOMOD="" export GIT_SSL_NO_VERIFY=1 make image_bins make images make generate-yaml mkdir _output/DockFile/ docker save -o _output/DockFile/vc-webhook-manager-base.tar.gz volcanosh/vc-webhook-manager-base docker save -o _output/DockFile/vc-webhook-manager.tar.gz volcanosh/vc-webhook-manager docker save -o _output/DockFile/vc-controller-manager.tar.gz volcanosh/vc-controller-manager docker save -o _output/DockFile/vc-vc-scheduler.tar.gz volcanosh/vc-scheduler
-
编译镜像
chmod +x build.sh ./build.sh
-
查看镜像
docker images | grep volcanosh
以下操作位于Atlas服务器
-
登录atlas服务器,安装golang
-
配置golang编译环境
执行:
vim ~/.bashrc
, 输入以下内容并保存:export GO111MODULE=on export GOPROXY=https://gocenter.io export GONOSUMDB=*
-
将ascend-device-plugin文件夹上传到任意目录(如“/home”)
-
在ascend-device-plugin目录下创建prepare_build.sh文件
cd /home/ascend-device-plugin/build
vim prepare_build.sh
根据实际写入以下内容:
#!/bin/bash ASCNED_TYPE=910 #根据芯片类型选择310或910。 ASCNED_INSTALL_PATH=/usr/local/Ascend #驱动安装路径,根据实际修改。 USE_ASCEND_DOCKER=false #是否使用昇腾Docker,请修改为false。 CUR_DIR=$(dirname $(readlink -f $0)) TOP_DIR=$(realpath ${CUR_DIR}/..) LD_LIBRARY_PATH_PARA1=${ASCNED_INSTALL_PATH}/driver/lib64/driver LD_LIBRARY_PATH_PARA2=${ASCNED_INSTALL_PATH}/driver/lib64 apt-get install -y pkg-config apt-get install -y dos2unix TYPE=Ascend910 PKG_PATH=${TOP_DIR}/src/plugin/config/config_910 PKG_PATH_STRING=\$\{TOP_DIR\}/src/plugin/config/config_910 LIBDRIVER="driver/lib64/driver" if [ ${ASCNED_TYPE} == "310" ]; then TYPE=Ascend310 LD_LIBRARY_PATH_PARA1=${ASCNED_INSTALL_PATH}/driver/lib64 PKG_PATH=${TOP_DIR}/src/plugin/config/config_310 PKG_PATH_STRING=\\$\\{TOP_DIR\\}/src/plugin/config/config_310 LIBDRIVER="/driver/lib64" fi sed -i "s/Ascend[0-9]\\{3\\}/${TYPE}/g" ${TOP_DIR}/ascendplugin.yaml sed -i "s#ath: /usr/local/Ascend/driver#ath: ${ASCNED_INSTALL_PATH}/driver#g" ${TOP_DIR}/ascendplugin.yaml sed -i "/^ENV LD_LIBRARY_PATH /c ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH_PARA1}:${LD_LIBRARY_PATH_PARA2}/common" ${TOP_DIR}/Dockerfile sed -i "/^ENV USE_ASCEND_DOCKER /c ENV USE_ASCEND_DOCKER ${USE_ASCEND_DOCKER}" ${TOP_DIR}/Dockerfile sed -i "/^libdriver=/c libdriver=$\\{prefix\\}/${LIBDRIVER}" ${PKG_PATH}/ascend_device_plugin.pc sed -i "/^prefix=/c prefix=${ASCNED_INSTALL_PATH}" ${PKG_PATH}/ascend_device_plugin.pc sed -i "/^CONFIGDIR=/c CONFIGDIR=${PKG_PATH_STRING}" ${CUR_DIR}/build_in_docker.sh
-
编译镜像
chmod +x prepare_build.sh ./prepare_build.sh chmod +x build_910.sh dos2unix build_910.sh ./build_910.sh dockerimages
-
检查镜像
docker images | grep deviceplugin
IMAGE_PUSH_HUB_URL=harbor.sigsus.cn/sz_gongdianju/apulistech
或者
IMAGE_PUSH_HUB_URL=apulistech
./scripts/kfserving.sh push istio
./scripts/kfserving.sh push knative
./scripts/kfserving.sh push kfserving
cd DLWorkspace/src/ClusterBootstrap/
vim config.yaml
cluster_name: atlas
network:
domain: sigsus.cn
container-network-iprange: "10.0.0.0/8"
UserGroups:
DLWSAdmins:
Allowed:
- [email protected]
gid: "20001"
uid: "20000"
DLWSRegister:
Allowed:
- '@gmail.com'
- '@live.com'
- '@outlook.com'
- '@hotmail.com'
- '@apulis.com'
gid: "20001"
uid: 20001-29999
WebUIadminGroups:
- DLWSAdmins
WebUIauthorizedGroups:
- DLWSAdmins
WebUIregisterGroups:
- DLWSRegister
datasource: MySQL
mysql_password: apulis#2019#wednesday
webuiport: 3081
useclusterfile : true
machines:
master:
role: infrastructure
private-ip: 192.168.3.2
archtype: arm64
type: npu
vendor: huawei
worker01:
archtype: amd64
role: worker
type: gpu
vendor: nvidia
os: ubuntu
worker02:
archtype: amd64
role: worker
type: gpu
vendor: nvidia
os: ubuntu
# settings for docker
private_docker_registry: harbor.sigsus.cn:8443/dlts/
dockerregistry: apulistech/
dockers:
hub: apulistech/
tag: "1.9"
dataFolderAccessPoint: ''
Authentications:
Microsoft:
TenantId:
ClientId:
ClientSecret:
Wechat:
AppId:
AppSecret:
mountpoints:
nfsshare1:
type: nfs
server: master
filesharename: /mnt/local
curphysicalmountpoint: /mntdlws
mountpoints: ""
repair-manager:
cluster_name: "atlas"
ecc_rule:
cordon_dry_run: True
alert:
smtp_url: smtp.qq.com
login:
password:
sender:
receiver: ["[email protected]"]
enable_custom_registry_secrets: True
platform_name: Apulis Platform
kube-vip: XXX.XXX.XXX.XXX
配置信息说明
-
需依照实际情况修改的字段包括以下
1)machines:修改机器IP兼短域名
2)dockerregistry:修改hub.docker中的组织名
3)kube-vip:单master情况下,填入master节点内网IP
-
与其它环节存在依赖字段包括
1)private_docker_registry:此字段中harbor.sigsus.cn:8443应与配置harbor环节所采用的域名保持一致
-
切换目录:cd DLWorkspace/src/ClusterBootstrap/
-
安装部署节点环境
./scripts/prepare_ubuntu_dev.sh
-
安装集群节点环境
./deploy.py --verbose sshkey install ./deploy.py --verbose runscriptonall ./scripts/prepare_ubuntu.sh ./deploy.py --verbose runscriptonall ./scripts/prepare_ubuntu.sh continue ./deploy.py --verbose execonall sudo usermod -aG docker dlwsadmin
-
安装K8S集群
./deploy.py --verbose execonall sudo swapoff -a ./deploy.py runscriptonroles infra worker ./scripts/install_kubeadm.sh ./deploy.py --verbose kubeadm init ./deploy.py --verbose copytoall ./deploy/sshkey/admin.conf /root/.kube/config ./deploy.py --verbose kubeadm join ./deploy.py --verbose -y kubernetes labelservice ./deploy.py --verbose -y labelworker
-
渲染集群配置
./deploy.py renderservice ./deploy.py renderimage ./deploy.py webui ./deploy.py nginx webui3 ./deploy.py nginx fqdn ./deploy.py nginx config
-
挂载共享存储
./deploy.py runscriptonroles infra worker ./scripts/install_nfs.sh ./deploy.py --force mount ./deploy.py execonall "df -h"
-
启动集群服务器
./deploy.py kubernetes start nvidia-device-plugin ./deploy.py kubernetes start a910-device-plugin ./deploy.py kubernetes start mysql ./deploy.py kubernetes start jobmanager2 restfulapi2 nginx custommetrics repairmanager2 openresty ./deploy.py --sudo --background runscriptonall scripts/npu/npu_info_gen.py ./deploy.py kubernetes start monitor ./deploy.py kubernetes start istio ./deploy.py kubernetes start knative kfserving ./deploy.py kubernetes start webui3 custom-user-dashboard image-label aiarts-frontend aiarts-backend data-platform