-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed deployments using TKGm on VCD #1351
Comments
Hi @rvvliet78, Would you please update which CSE server you are using, and which VCD are you using? |
We use CSE 3.1.2 and vcd 10.3.2 |
I am tagging @sakthisunda for input. |
|
We use the following:
As regards to the SNAT part, as sometimes the cluster creation fails at the stage of workernode2 deployment it means that workernode1 and the mstr node are already deployed, which would indicate that all SNAT and FW is configured correctly. Last week we also created a Service Request as well, and uploaded all the log files: SR 22322661504 |
I have also seen this issue with vCD 10.3.2 and CSE 3.1.2 / 3.1.3. The problem is that the customization script gets stuck, sometimes at a beginning stage, sometimes later. As mentioned by @rvvliet78 if the SSH key has been added, you can log in and run the script manually. Sometimes it gets stuck before this stage.
I have seen this with the following OVAs:
I also have an SR open: 22325618704 |
Strangely - it seems better if I create a new SSH key. The SSH key I was pasting into the vCD UI setup wizard was one that belonged to my AD user, so jleavers@mydomain.local@myserver. I created a new one with |
@jleavers I had a problem with deployment when I pasted SSH key with included newline, maybe that was the case with your old key? |
I've had clusters fail at a later stage. Also a new template worked for about a week. And this week I was able to create clusters with a system default policy but fail with a custom policy. I've also had this issue without a ssh key. |
There was no newline so that wasn't it - while I had a successful cluster creation with the new key, I have since tried to resize the cluster (i.e. add more workers) and the process has failed at various steps, e.g.
|
Hi @jleavers, can you try creating a few clusters with the default sizing policy using the ssh key that has been failing to see if you can get successful creations from default sizing policy? If possible could you also try deploying the cluster using cse cli with ssh key and custom sizing policy and rollbackOnFailure to |
Same here. Seems to be buggy. Different behaviors. If we try to deploy with two worker nodes (default), the first worker node finishes successfull, and the second worker ends with timeout at " Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress". We also had problems deploying with ssh keys. We saw adding ssh-key timeouts. Same when we deploy only one worker: 22-05-19 19:02:25 | cluster_service_2_x_tkgm:2849 - wait_for_adding_worker_vm_to_vapp | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for add worker vm to vapp, status: running |
@cm3cmf Can you work with VMware Global Support to file a case for this failure and share back the case number? |
A week ago, I have also opened SR 22327432305 for this kind of problem and have already uploaded logs from successful and from failed clusters created with rollbackOnFailure: false. My suspicion is that VM customization is causing reboot in the middle of execution of node.sh script or cloud-init execution, so node preparation is never complete, but I haven't had a chance to inspect any further. Looking at /var/log/cloud-init-output.log at both VMs, failed node has these lines containing word "running":
and there was a reboot between 17:31:17 and 17:31:23.
and there was a reboot between 17:22:55 and 17:23:30 Last status update is that CSE team will start investigating it this week. @goelaashima If you need some logs maybe I can help since I still have those failed and successful nodes, |
@goelaashima since there are similar SR's opened yet, it will make no sense to raise a new one ... |
@mnspodrska I am following up on SR 22327432305. Thanks. |
@lzichong Looking at the docs here: https://vmware.github.io/container-service-extension/cse3_0/CLUSTER_MANAGEMENT.html But getting an error - any suggestions?
|
Additional testing:
No custom sizing policy - instead manually entered CPU and RAM, + no SSH key: OK It does not seem to be related to the sizing profile, and while there may be an issue with one SSH key, this does not seem to be the entire problem, because resizing the cluster to add more nodes can still fail - here it failed with an SSH key error even though there is no key. It is very random. |
Hi @jleavers, could you check/post the cse client log and list any errors/exception being thrown in there? This error may be happening due to some sort of version mismatch between 3.1.3/3.1.2 from the client/server but it is hard for us to tell. Followed by your last comment, it seems like we need to dig deeper and go into the VM to see what's happening during the script executions. Were you able to bypass the 2.1.0 error and deploy clusters using CLI? |
Hi @lzichong - I have not been able to bypass the 2.1.0 error as yet unfortunately. There is not a huge amount of detail in the log. vCD version: 10.3.2.19473806 Commands:
Logs:
|
Hi @jleavers, can you make a copy of the current logs for both server and client and follow it up to the SR? Were you using CSE 3.1.3 prior to 3.1.2? If so, can you try using CSE version 3.1.3 and updating the apiVersion in the yaml file to v2.1 |
Hi @lzichong - I did initially try CSE 3.1.3, however I removed it and reinstalled 3.1.2 as the interop matrix indicates that only vCD 10.3.3 is supported and I am on vCD 10.3.2.1, which can't be upgraded as 10.3.3 is not yet supported by the NSX-T migration tool. I will upload some logs. |
@jleavers Apologies for the interop matrix for VCD and CSE, it wasn't show-casing all the possible supported versions. We have fixed it on our end. CSE 3.1.3 supports VCD 10.3.2. |
Hi @goelaashima - thanks for the confirmation. I updated to CSE 3.1.3, but was seeing some strange errors, so I uninstalled, amended the venv to Python 3.8, and reinstalled. I am now back to the original problem of buggy behaviour on cluster creation or resize :-) Current status:
Attempt 1: cluster build successful. Resize failed:
Attempt 2: cluster build failed adding worker node:
As the SSH key had been added, I could SSH in and run
Attempt 3: cluster build failed adding worker node:
With this last attempt, I looked at the VM's console, and it said |
Deletions are now failing with CSE 3.1.3 - they get stuck on |
@jleavers Also see the stuck clusters that cant be deleted. Running CSE 3.1.3. Tried running From the log cse-server-debug.log it says |
I was wondering if a solution or workaround has been found for the issues described in the opening post? I also have a bunch of cluster I can't delete with the same errors as described above. |
After some screen-shares on the VMware SR, we found that the VMs were rebooting and the last command that ran prior to the reboot was
The files it was checking for did seem to exist, but potentially they did not yet exist when the script ran. As a temporary workaround we removed these checks by editing the cloud-init files. I am running the CSE in a Python 3.8 venv, so for me the paths are as below, although they may vary on other setups.
If you scroll to the end you will see the I am seeing fewer cluster deployment failures after this change. There are still some that I see getting stuck, however if I log into the node and run For the deletions, some extra permissions had to be granted to the vCD tenant user via API, using the CSE service account credentials - this was a POST request to https://vcd-url/cloudapi/1.0.0/entityTypes/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls/ with the following payload:
You can get the user's UUID via the vCD UI by clicking on the username in Administration -> Access Control -> Users - it will be in the URL. NB: as shown this API call was for the nativeCluster version 2.1.0. This may differ in other environments so you would need to check the entityTypes with a GET call to https://vcd-url/cloudapi/1.0.0/entityTypes/ as well as the version for each deployed cluster with a GET call to https://vcd-url/cloudapi/1.0.0/entities/interfaces/vmware/k8s/1.0.0. |
I got similar advice to remove those sudo reboot commands from cloud_init_control_plane.yaml and cloud_init_node.yaml but it seems to me that reboot is not caused by cloud-init runcmd but vmware imc customization since /var/log/vmware-imc/toolsDeployPkg.log contains lines: [2022-05-30T19:40:08.292Z] [ info] sSkipReboot: 'false', forceSkipReboot 'false'. [2022-05-30T19:40:08.292Z] [ debug] Ran DeployPkg_DeployPackageFromFile successfully I am under the impression that cloud-init is somehow triggered before vmware imc customization is done, so it gets interrupted by imc rebooting after network configuration. We got an idea to try to override cloud-init-local systemd service to get it to wait for imc customization to end - something like: Will report what we find... |
I see a sleep statement being added here: [VCDA-3786] Fix for ubuntu 20 cluster creation |
Describe the bug
When deploying a TKGm cluster from VCD we have inconsistent results, in some cases deployment is successful but in some cases it failes at creating the control plane node, or later when deploying the worker nodes (sometime the first sometimes the second)
In all cases in the cse-server-debig log we see it fail during the customization phase:
It failes at different steps where it times out with loads of:
INFO :: Post guest customization phase guestinfo.postcustomization.
sometimes the:
guestinfo.postcustomization.networkconfiguration.status
sometimes
guestinfo.postcustomization.store.sshkey.status
And sometimes later in the steps.
When it does go past the ssh keys I can login to the VM and run some scripts from /root manually and the node will be added to the cluster, but this should not be needed.
What is the best approach to troubleshoot the origin of this issue
This also seems to happen when adding worker nodes to an exsisting cluster.
Reproduction steps
When creating a cluster, the results are mixed, I can't really tell when it will fail.
Expected behavior
Give the deploy command and the cluster will be created
Additional context
Some extra log info of a deployment which failed during control node creation:
22-04-14 06:59:25 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 82) 22-04-14 06:59:26 | cluster_service_2_x_tkgm:2684 - _get_vm_specifications | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Found sizing policy with name 4CPU_6GB on the VDC rvvliet - Production 22-04-14 06:59:27 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: queued 22-04-14 06:59:32 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:37 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:42 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:47 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:52 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:57 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:02 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:08 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:13 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:18 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:23 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: success 22-04-14 07:00:34 | cluster_service_2_x_tkgm:2619 - wait_for_updating_cloud_init_spec | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec, status: queued 22-04-14 07:00:39 | cluster_service_2_x_tkgm:2619 - wait_for_updating_cloud_init_spec | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec, status: success 22-04-14 07:00:40 | cluster_service_2_x_tkgm:2623 - wait_for_updating_cloud_init_spec_encoding | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec encoding, status: queued 22-04-14 07:00:45 | cluster_service_2_x_tkgm:2623 - wait_for_updating_cloud_init_spec_encoding | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec encoding, status: success 22-04-14 07:00:45 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: queued 22-04-14 07:00:50 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: running 22-04-14 07:00:56 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: running 22-04-14 07:01:01 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: success 22-04-14 07:01:01 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None 22-04-14 07:01:06 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is successful 22-04-14 07:01:07 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None 22-04-14 07:01:12 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None 22-04-14 07:01:17 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is successful 22-04-14 07:01:17 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:22 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:28 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:33 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:38 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:43 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:48 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None TRUNCATED 22-04-14 07:11:07 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:13 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:18 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:18 | pyvcloud_utils:773 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: VM Post guest customization failed due to timeout(600 sec) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:2354 - _add_control_plane_nodes | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: 22-04-14 07:11:18 | cluster_service_2_x_tkgm:862 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: failure on creating nodes ['mstr-hzbg'] 22-04-14 07:11:18 | cluster_service_2_x_tkgm:958 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: Error creating cluster 'otomi-poc' 22-04-14 07:11:18 | cluster_service_2_x_tkgm:1747 - _delete_refresh_token | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Successfully deleted the refresh token with name cluster-urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15 22-04-14 07:11:18 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': '38ded585-4c92-4fd0-8f24-1a363eb29771', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "running", "operation": "Error creating cluster \'otomi-poc\'. Deleting cluster (rollback=True)"}'} 22-04-14 07:11:18 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 83) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:972 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Error creating cluster 'otomi-poc'. Deleting cluster (rollback=True) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:2025 - _delete_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Deleting vapp otomi-poc in (org: rvvliet, vdc: rvvliet - Production) 22-04-14 07:11:39 | cluster_service_2_x_tkgm:2040 - _delete_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Deleted vapp otomi-poc (vdc: https://az3.nl.sentia.cloud/api/vdc/dda63040-ce39-404f-b593-2782d9976d9b) 22-04-14 07:11:45 | cluster_service_2_x_tkgm:990 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Deleted dnat rule for cluster otomi-poc (urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15) 22-04-14 07:11:46 | entity_service:603 - resolve_entity | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: #/status: required key [nodes] not found 22-04-14 07:11:46 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': '38ded585-4c92-4fd0-8f24-1a363eb29771', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "error", "error": {"majorErrorCode": "500", "minorErrorCode": "Error creating cluster \'otomi-poc\'. Deleting cluster (rollback=True)", "message": "Error adding control plane node: failure on creating nodes [\'mstr-hzbg\']\\nError:"}}'} 22-04-14 07:11:46 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 84)
The text was updated successfully, but these errors were encountered: