Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed deployments using TKGm on VCD #1351

Open
rvvliet78 opened this issue Apr 13, 2022 · 30 comments
Open

Failed deployments using TKGm on VCD #1351

rvvliet78 opened this issue Apr 13, 2022 · 30 comments
Labels

Comments

@rvvliet78
Copy link

rvvliet78 commented Apr 13, 2022

Describe the bug

When deploying a TKGm cluster from VCD we have inconsistent results, in some cases deployment is successful but in some cases it failes at creating the control plane node, or later when deploying the worker nodes (sometime the first sometimes the second)
In all cases in the cse-server-debig log we see it fail during the customization phase:
It failes at different steps where it times out with loads of:
INFO :: Post guest customization phase guestinfo.postcustomization.
sometimes the:
guestinfo.postcustomization.networkconfiguration.status
sometimes
guestinfo.postcustomization.store.sshkey.status
And sometimes later in the steps.

When it does go past the ssh keys I can login to the VM and run some scripts from /root manually and the node will be added to the cluster, but this should not be needed.

What is the best approach to troubleshoot the origin of this issue

This also seems to happen when adding worker nodes to an exsisting cluster.

Reproduction steps

When creating a cluster, the results are mixed, I can't really tell when it will fail.

Expected behavior

Give the deploy command and the cluster will be created

Additional context

Some extra log info of a deployment which failed during control node creation:

22-04-14 06:59:25 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 82) 22-04-14 06:59:26 | cluster_service_2_x_tkgm:2684 - _get_vm_specifications | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Found sizing policy with name 4CPU_6GB on the VDC rvvliet - Production 22-04-14 06:59:27 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: queued 22-04-14 06:59:32 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:37 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:42 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:47 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:52 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 06:59:57 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:02 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:08 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:13 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:18 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: running 22-04-14 07:00:23 | cluster_service_2_x_tkgm:2591 - wait_for_adding_control_plane_vm_to_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for control plane add vm to vapp, status: success 22-04-14 07:00:34 | cluster_service_2_x_tkgm:2619 - wait_for_updating_cloud_init_spec | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec, status: queued 22-04-14 07:00:39 | cluster_service_2_x_tkgm:2619 - wait_for_updating_cloud_init_spec | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec, status: success 22-04-14 07:00:40 | cluster_service_2_x_tkgm:2623 - wait_for_updating_cloud_init_spec_encoding | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec encoding, status: queued 22-04-14 07:00:45 | cluster_service_2_x_tkgm:2623 - wait_for_updating_cloud_init_spec_encoding | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: cloud init spec encoding, status: success 22-04-14 07:00:45 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: queued 22-04-14 07:00:50 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: running 22-04-14 07:00:56 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: running 22-04-14 07:01:01 | cluster_service_2_x_tkgm:2599 - wait_for_vm_power_on | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: waiting for vm power on, status: success 22-04-14 07:01:01 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None 22-04-14 07:01:06 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is successful 22-04-14 07:01:07 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None 22-04-14 07:01:12 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None 22-04-14 07:01:17 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is successful 22-04-14 07:01:17 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:22 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:28 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:33 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:38 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:43 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:01:48 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None TRUNCATED 22-04-14 07:11:07 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:13 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:18 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None 22-04-14 07:11:18 | pyvcloud_utils:773 - wait_for_completion_of_post_customization_procedure | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: VM Post guest customization failed due to timeout(600 sec) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:2354 - _add_control_plane_nodes | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: 22-04-14 07:11:18 | cluster_service_2_x_tkgm:862 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: failure on creating nodes ['mstr-hzbg'] 22-04-14 07:11:18 | cluster_service_2_x_tkgm:958 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: Error creating cluster 'otomi-poc' 22-04-14 07:11:18 | cluster_service_2_x_tkgm:1747 - _delete_refresh_token | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Successfully deleted the refresh token with name cluster-urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15 22-04-14 07:11:18 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': '38ded585-4c92-4fd0-8f24-1a363eb29771', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "running", "operation": "Error creating cluster \'otomi-poc\'. Deleting cluster (rollback=True)"}'} 22-04-14 07:11:18 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 83) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:972 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Error creating cluster 'otomi-poc'. Deleting cluster (rollback=True) 22-04-14 07:11:18 | cluster_service_2_x_tkgm:2025 - _delete_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Deleting vapp otomi-poc in (org: rvvliet, vdc: rvvliet - Production) 22-04-14 07:11:39 | cluster_service_2_x_tkgm:2040 - _delete_vapp | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Deleted vapp otomi-poc (vdc: https://az3.nl.sentia.cloud/api/vdc/dda63040-ce39-404f-b593-2782d9976d9b) 22-04-14 07:11:45 | cluster_service_2_x_tkgm:990 - _create_cluster_async | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | INFO :: Deleted dnat rule for cluster otomi-poc (urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15) 22-04-14 07:11:46 | entity_service:603 - resolve_entity | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | ERROR :: #/status: required key [nodes] not found 22-04-14 07:11:46 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': '38ded585-4c92-4fd0-8f24-1a363eb29771', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:16c14277-e6ba-4b3f-8f10-f48b372d5a15', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "error", "error": {"majorErrorCode": "500", "minorErrorCode": "Error creating cluster \'otomi-poc\'. Deleting cluster (rollback=True)", "message": "Error adding control plane node: failure on creating nodes [\'mstr-hzbg\']\\nError:"}}'} 22-04-14 07:11:46 | mqtt_publisher:116 - send_response | Request Id: acd20ee7-1256-4d8c-9e06-e78a1b861e75 | DEBUG :: publish return (rc, msg_id): (0, 84)

@rvvliet78 rvvliet78 added the bug label Apr 13, 2022
@ltimothy7
Copy link
Contributor

Hi @rvvliet78, Would you please update which CSE server you are using, and which VCD are you using?

@rvvliet78
Copy link
Author

We use CSE 3.1.2 and vcd 10.3.2

@ltimothy7
Copy link
Contributor

I am tagging @sakthisunda for input.

@sakthisunda
Copy link
Contributor

  • Would you please update on what TKGm ova is used (Product version : 1.4, 1.3 and ova file name). This will help us in verifying or reproducing the issue at our end, if needed.
  • What kind of ovdc network : say we have done lot of tests successfully with routed org-vdc with VMs on ovdc network(192.168.7.0/24) with SNAT rule that maps internal cidr to external ip.
  • If you are using routed org-vdc network, make sure that your SNAT rule works ( The 0.0.0.0/24  rule does NOT work due to some issues in vcd)
  • Sometimes, not the least, flaky internet from the VM to external repository while cluster creation might cause timeout

@rvvliet78
Copy link
Author

We use the following:

  • OVA: ubuntu-2004-kube-v1.21.8-vmware.1-tkg.2-ed3c93616a02968be452fe1934a1d37c (and have the same issue with version 1.21.2)
  • Network (in this example): 192.168.230.1/24, part of a VDCgroup, with SNAT from the subnet and full access to the internet. We've setup an ip pool with more than enough IP's available.
  • SNAT works and FW is fully open (using the any rule)
  • We have not seen any flaky internet connections (otherwise we would have loads of customer complaints as well)

As regards to the SNAT part, as sometimes the cluster creation fails at the stage of workernode2 deployment it means that workernode1 and the mstr node are already deployed, which would indicate that all SNAT and FW is configured correctly.

Last week we also created a Service Request as well, and uploaded all the log files: SR 22322661504

@jleavers
Copy link

jleavers commented May 5, 2022

I have also seen this issue with vCD 10.3.2 and CSE 3.1.2 / 3.1.3. The problem is that the customization script gets stuck, sometimes at a beginning stage, sometimes later. As mentioned by @rvvliet78 if the SSH key has been added, you can log in and run the script manually. Sometimes it gets stuck before this stage.

/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:28 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:33 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:38 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:44 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:49 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:54 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
/opt/cse/.cse-logs/cse-server-info.log:22-05-03 10:55:59 | Request Id: 7103dd5f-fb8f-43de-a02b-7704fc5d8da4 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None

I have seen this with the following OVAs:

  • ubuntu-2004-kube-v1.21.8+vmware.1-tkg.2-ed3c93616a02968be452fe1934a1d37c.ova
  • ubuntu-2004-kube-v1.22.5+vmware.1-tkg.2-f838b27ca494fee7083c0340e11ce243.ova

I also have an SR open: 22325618704

@jleavers
Copy link

jleavers commented May 6, 2022

Strangely - it seems better if I create a new SSH key. The SSH key I was pasting into the vCD UI setup wizard was one that belonged to my AD user, so jleavers@mydomain.local@myserver. I created a new one with ssh-keygen -t rsa -C Tanzu and have successfully built three clusters. I changed back to the original SSH key and it failed three times in a row.

@mnspodrska
Copy link

@jleavers I had a problem with deployment when I pasted SSH key with included newline, maybe that was the case with your old key?

@rvvliet78
Copy link
Author

rvvliet78 commented May 6, 2022

I've had clusters fail at a later stage. Also a new template worked for about a week. And this week I was able to create clusters with a system default policy but fail with a custom policy. I've also had this issue without a ssh key.
This past week I had a troubleshooting session and a next session somewhere next week.

@jleavers
Copy link

There was no newline so that wasn't it - while I had a successful cluster creation with the new key, I have since tried to resize the cluster (i.e. add more workers) and the process has failed at various steps, e.g.

  • Attempt 1: Post guest customization phase guestinfo.postcustomization.proxy.setting.status
  • Attempt 2: Post guest customization phase guestinfo.postcustomization.kubeadm.node.join.status

@lzichong
Copy link

lzichong commented May 11, 2022

Hi @jleavers, can you try creating a few clusters with the default sizing policy using the ssh key that has been failing to see if you can get successful creations from default sizing policy? If possible could you also try deploying the cluster using cse cli with ssh key and custom sizing policy and rollbackOnFailure to rollbackOnFailure: false in spec.settings? You can specify the custom sizing policy in topology.controlPlane/workers/nfs.sizingClass. The rollback on failure is not an option in the UI and this would prevent the VM from deleting in case of failure allowing for easier debug/allow you to manually run the the script.

@adestis-mf
Copy link

Same here. Seems to be buggy. Different behaviors. If we try to deploy with two worker nodes (default), the first worker node finishes successfull, and the second worker ends with timeout at " Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress". We also had problems deploying with ssh keys. We saw adding ssh-key timeouts.
But sometimes it works - sometimes not. That's why I rule out infrastructure problems

Same when we deploy only one worker:

22-05-19 19:02:25 | cluster_service_2_x_tkgm:2849 - wait_for_adding_worker_vm_to_vapp | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for add worker vm to vapp, status: running
22-05-19 19:02:30 | cluster_service_2_x_tkgm:2849 - wait_for_adding_worker_vm_to_vapp | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for add worker vm to vapp, status: running
22-05-19 19:02:35 | cluster_service_2_x_tkgm:2849 - wait_for_adding_worker_vm_to_vapp | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for add worker vm to vapp, status: running
22-05-19 19:02:40 | cluster_service_2_x_tkgm:2849 - wait_for_adding_worker_vm_to_vapp | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for add worker vm to vapp, status: success
22-05-19 19:02:42 | cluster_service_2_x_tkgm:2827 - _get_kube_config_from_control_plane_vm | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: Got kubeconfig from control plane extra configuration successfully
22-05-19 19:02:42 | cluster_service_2_x_tkgm:2873 - wait_for_updating_cloud_init_spec | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec, status: queued
22-05-19 19:02:48 | cluster_service_2_x_tkgm:2873 - wait_for_updating_cloud_init_spec | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec, status: running
22-05-19 19:02:53 | cluster_service_2_x_tkgm:2873 - wait_for_updating_cloud_init_spec | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec, status: success
22-05-19 19:02:54 | cluster_service_2_x_tkgm:2877 - wait_for_updating_cloud_init_spec_encoding | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec encoding, status: queued
22-05-19 19:03:00 | cluster_service_2_x_tkgm:2877 - wait_for_updating_cloud_init_spec_encoding | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec encoding, status: running
22-05-19 19:03:05 | cluster_service_2_x_tkgm:2877 - wait_for_updating_cloud_init_spec_encoding | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: cloud init spec encoding, status: success
22-05-19 19:03:06 | cluster_service_2_x_tkgm:2881 - wait_for_updating_kubeconfig | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: adding kubeconfig, status: queued
22-05-19 19:03:11 | cluster_service_2_x_tkgm:2881 - wait_for_updating_kubeconfig | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: adding kubeconfig, status: success
22-05-19 19:03:12 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: queued
22-05-19 19:03:17 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: running
22-05-19 19:03:22 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: running
22-05-19 19:03:27 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: running
22-05-19 19:03:32 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: running
22-05-19 19:03:37 | cluster_service_2_x_tkgm:2853 - wait_for_vm_power_on | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: waiting for vm power on, status: success
22-05-19 19:03:37 | cluster_service_2_x_tkgm:2701 - _add_worker_nodes | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | DEBUG :: worker node-8baf to join cluster using:kubeadm join 10.251.251.2:6443 --token jzbm2g.6oktm11rv34omuiq --discovery-token-ca-cert-hash sha256:fad4a9775a2cbec0d5bf9590b3f2cfd7b030101839798453d501b4d023c75a81
22-05-19 19:03:38 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
22-05-19 19:03:44 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
22-05-19 19:03:49 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
22-05-19 19:03:54 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
22-05-19 19:03:59 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is None
22-05-19 19:04:05 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is successful
22-05-19 19:04:06 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is successful
22-05-19 19:04:07 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None
22-05-19 19:04:12 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None
22-05-19 19:04:17 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is successful
22-05-19 19:04:18 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.kubeadm.node.join.status is in_progress
22-05-19 19:04:23 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.kubeadm.node.join.status is in_progress
22-05-19 19:04:29 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.kubeadm.node.join.status is successful
22-05-19 19:04:29 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:04:35 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:04:40 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:04:45 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:04:50 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:04:56 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:01 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:06 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:11 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:17 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:22 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:27 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:32 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:38 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:43 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:48 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:53 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:05:59 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:04 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:09 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:15 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:20 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:25 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:30 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:36 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:41 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:46 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:51 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:06:57 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:02 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:07 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:12 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:19 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:25 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:31 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:38 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:44 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress
22-05-19 19:07:50 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: cf60ade4-3e27-41ac-8d67-d83729a62c03 | INFO :: Post guest customization phase guestinfo.postcustomization.core_packages.attempted_install is in_progress

@goelaashima
Copy link
Collaborator

@cm3cmf

Can you work with VMware Global Support to file a case for this failure and share back the case number?

@mnspodrska
Copy link

mnspodrska commented May 19, 2022

A week ago, I have also opened SR 22327432305 for this kind of problem and have already uploaded logs from successful and from failed clusters created with rollbackOnFailure: false. My suspicion is that VM customization is causing reboot in the middle of execution of node.sh script or cloud-init execution, so node preparation is never complete, but I haven't had a chance to inspect any further.

Looking at /var/log/cloud-init-output.log at both VMs, failed node has these lines containing word "running":

[2022-05-09 17:31:05] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init-local' at Mon, 09 May 2022 17:31:05 +0000. Up 2.72 seconds.
[2022-05-09 17:31:16] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init' at Mon, 09 May 2022 17:31:16 +0000. Up 13.52 seconds.
[2022-05-09 17:31:17] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'modules:config' at Mon, 09 May 2022 17:31:17 +0000. Up 14.47 seconds.
[2022-05-09 17:31:23] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init-local' at Mon, 09 May 2022 17:31:23 +0000. Up 2.02 seconds.
[2022-05-09 17:31:25] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init' at Mon, 09 May 2022 17:31:25 +0000. Up 4.44 seconds.
[2022-05-09 17:31:26] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'modules:config' at Mon, 09 May 2022 17:31:26 +0000. Up 5.14 seconds.
[2022-05-09 17:31:26] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'modules:final' at Mon, 09 May 2022 17:31:26 +0000. Up 5.51 seconds.

and there was a reboot between 17:31:17 and 17:31:23.
Succesfull VM has messages:

[2022-05-09 17:22:55] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init-local' at Mon, 09 May 2022 17:22:55 +0000. Up 2.54 seconds.
[2022-05-09 17:23:30] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init-local' at Mon, 09 May 2022 17:23:30 +0000. Up 2.02 seconds.
[2022-05-09 17:23:32] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'init' at Mon, 09 May 2022 17:23:32 +0000. Up 4.34 seconds.
[2022-05-09 17:23:33] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'modules:config' at Mon, 09 May 2022 17:23:33 +0000. Up 5.44 seconds.
[2022-05-09 17:24:41] Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.04.1 running 'modules:final' at Mon, 09 May 2022 17:23:34 +0000. Up 6.12 seconds.

and there was a reboot between 17:22:55 and 17:23:30

Last status update is that CSE team will start investigating it this week.

@goelaashima If you need some logs maybe I can help since I still have those failed and successful nodes,

@adestis-mf
Copy link

@goelaashima since there are similar SR's opened yet, it will make no sense to raise a new one ...
@mnspodrska thanks for your reply!!!

@goelaashima
Copy link
Collaborator

@mnspodrska I am following up on SR 22327432305. Thanks.
@cm3cmf The issue you are seeing may or may not be the same as other SRs. It would be best to file a case.

@jleavers
Copy link

@lzichong Looking at the docs here: https://vmware.github.io/container-service-extension/cse3_0/CLUSTER_MANAGEMENT.html

But getting an error - any suggestions?

vcd cse version
CSE, Container Service Extension for VMware vCloud Director, version 3.1.2

vcd cse cluster apply -s -t -o file.yaml

vcd cse cluster apply /opt/cse/file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

vcd cse cluster apply file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

@jleavers
Copy link

@lzichong

Additional testing:

  1. Default sizing policy + original SSH key: fail:
==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-20 12:24:54 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: ee41ac2f-9a10-4947-b8ae-866e40e3bd0d | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is in_progress

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-20 12:24:54 | Request Id: ee41ac2f-9a10-4947-b8ae-866e40e3bd0d | ERROR :: VM Post guest customization failed due to timeout(600 sec)
  1. No custom sizing policy - instead manually entered CPU and RAM, + original SSH key: fail

22-05-20 19:35:38 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: c99da0eb-daf8-4e14-a061-ad65bbb8d54e | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None

  1. Default sizing policy + no SSH key: OK

  2. Custom sizing policy + no SSH key: OK
    -> Try to resize this cluster: fail

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-20 20:34:19 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: c99da0eb-daf8-4e14-a061-ad65bbb8d54e | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-20 20:34:19 | Request Id: c99da0eb-daf8-4e14-a061-ad65bbb8d54e | ERROR :: VM Post guest customization failed due to timeout(600 sec)

No custom sizing policy - instead manually entered CPU and RAM, + no SSH key: OK
-> Try to resize this cluster: OK

It does not seem to be related to the sizing profile, and while there may be an issue with one SSH key, this does not seem to be the entire problem, because resizing the cluster to add more nodes can still fail - here it failed with an SSH key error even though there is no key.

It is very random.

@lzichong
Copy link

@lzichong Looking at the docs here: https://vmware.github.io/container-service-extension/cse3_0/CLUSTER_MANAGEMENT.html

But getting an error - any suggestions?

vcd cse version
CSE, Container Service Extension for VMware vCloud Director, version 3.1.2

vcd cse cluster apply -s -t -o file.yaml

vcd cse cluster apply /opt/cse/file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

vcd cse cluster apply file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

Hi @jleavers, could you check/post the cse client log and list any errors/exception being thrown in there? This error may be happening due to some sort of version mismatch between 3.1.3/3.1.2 from the client/server but it is hard for us to tell. Followed by your last comment, it seems like we need to dig deeper and go into the VM to see what's happening during the script executions. Were you able to bypass the 2.1.0 error and deploy clusters using CLI?

@jleavers
Copy link

Hi @lzichong - I have not been able to bypass the 2.1.0 error as yet unfortunately. There is not a huge amount of detail in the log.

vCD version: 10.3.2.19473806

Commands:

vcd version
vcd-cli, VMware vCloud Director Command Line Interface, 24.1.0

vcd cse version
CSE, Container Service Extension for VMware vCloud Director, version 3.1.2

vcd cse cluster apply file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

Logs:

==> /opt/cse/.cse-logs/cse-client-debug.log <==
22-05-21 07:23:04 | utils:154 - _override_client | DEBUG :: Server api versions : {'35.0', '36.0'}, Client api versions : {'35.0', '34.0', '36.0', '33.0'}, Server in Legacy mode : False, Selected api version : 36.0.
22-05-21 07:23:04 | cse:41 - version | DEBUG :: Executing command: vcd cse version
22-05-21 07:23:04 | cse:47 - version | DEBUG :: CSE, Container Service Extension for VMware vCloud Director, version 3.1.2
22-05-21 07:23:42 | utils:61 - cse_restore_session | DEBUG :: Restoring client from profile.
22-05-21 07:23:42 | cluster_commands:615 - apply | DEBUG :: Executing command: vcd cse cluster apply

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-21 07:23:42 | ERROR :: '2.1.0'

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-21 07:23:42 | entity_service:61 - exception_handler_wrapper | ERROR :: '2.1.0'

@lzichong
Copy link

Hi @lzichong - I have not been able to bypass the 2.1.0 error as yet unfortunately. There is not a huge amount of detail in the log.

vCD version: 10.3.2.19473806

Commands:

vcd version
vcd-cli, VMware vCloud Director Command Line Interface, 24.1.0

vcd cse version
CSE, Container Service Extension for VMware vCloud Director, version 3.1.2

vcd cse cluster apply file.yaml
Usage: vcd cse cluster apply [OPTIONS] CLUSTER_CONFIG_FILE_PATH
Try 'vcd cse cluster apply -h' for help.

Error: '2.1.0'

Logs:

==> /opt/cse/.cse-logs/cse-client-debug.log <==
22-05-21 07:23:04 | utils:154 - _override_client | DEBUG :: Server api versions : {'35.0', '36.0'}, Client api versions : {'35.0', '34.0', '36.0', '33.0'}, Server in Legacy mode : False, Selected api version : 36.0.
22-05-21 07:23:04 | cse:41 - version | DEBUG :: Executing command: vcd cse version
22-05-21 07:23:04 | cse:47 - version | DEBUG :: CSE, Container Service Extension for VMware vCloud Director, version 3.1.2
22-05-21 07:23:42 | utils:61 - cse_restore_session | DEBUG :: Restoring client from profile.
22-05-21 07:23:42 | cluster_commands:615 - apply | DEBUG :: Executing command: vcd cse cluster apply

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-21 07:23:42 | ERROR :: '2.1.0'

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-21 07:23:42 | entity_service:61 - exception_handler_wrapper | ERROR :: '2.1.0'

Hi @jleavers, can you make a copy of the current logs for both server and client and follow it up to the SR? Were you using CSE 3.1.3 prior to 3.1.2? If so, can you try using CSE version 3.1.3 and updating the apiVersion in the yaml file to v2.1

@jleavers
Copy link

Hi @lzichong - I did initially try CSE 3.1.3, however I removed it and reinstalled 3.1.2 as the interop matrix indicates that only vCD 10.3.3 is supported and I am on vCD 10.3.2.1, which can't be upgraded as 10.3.3 is not yet supported by the NSX-T migration tool.

I will upload some logs.

@goelaashima
Copy link
Collaborator

@jleavers Apologies for the interop matrix for VCD and CSE, it wasn't show-casing all the possible supported versions. We have fixed it on our end. CSE 3.1.3 supports VCD 10.3.2.

@jleavers
Copy link

Hi @goelaashima - thanks for the confirmation. I updated to CSE 3.1.3, but was seeing some strange errors, so I uninstalled, amended the venv to Python 3.8, and reinstalled. I am now back to the original problem of buggy behaviour on cluster creation or resize :-)

Current status:

  • I can trigger build via UI or via CLI (in order to have rollbackOnFailure: false)
  • I am building clusters with 1x worker, and resizing to 2x workers if the initial creation if successful

Attempt 1: cluster build successful. Resize failed:

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-25 09:23:16 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: a15e2caf-ef4c-4ffa-8a57-3c39f7654c5b | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-25 09:23:16 | Request Id: a15e2caf-ef4c-4ffa-8a57-3c39f7654c5b | ERROR :: VM Post guest customization failed due to timeout(600 sec)

Attempt 2: cluster build failed adding worker node:

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-25 18:23:19 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: 17d077a8-d8d9-4069-9ba4-0d40657f7f4a | INFO :: Post guest customization phase guestinfo.postcustomization.proxy.setting.status is None

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-25 18:23:19 | Request Id: 17d077a8-d8d9-4069-9ba4-0d40657f7f4a | ERROR :: VM Post guest customization failed due to timeout(600 sec)

As the SSH key had been added, I could SSH in and run /root/node.sh, however this failed at a later stage:

+ kubectl wait --for=condition=Ready pod/kapp-controller-76767677c9-tp5l7 -n kapp-controller --timeout=10m
error: timed out waiting for the condition on pods/kapp-controller-76767677c9-tp5l7

Attempt 3: cluster build failed adding worker node:

==> /opt/cse/.cse-logs/cse-server-debug.log <==
22-05-25 18:50:23 | pyvcloud_utils:761 - wait_for_completion_of_post_customization_procedure | Request Id: 38762d31-7d8d-48b4-84fe-be8706dfc6d9 | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is successful

==> /opt/cse/.cse-logs/cse-server-info.log <==
22-05-25 18:50:23 | Request Id: 38762d31-7d8d-48b4-84fe-be8706dfc6d9 | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is None

With this last attempt, I looked at the VM's console, and it said [2022-05-25 18:50:27] The system is ready after 5.42 seconds - so perhaps it got stuck because the VM hadn't finished booting when it started the next step of adding the SSH key at 18:50:23.

@jleavers
Copy link

Deletions are now failing with CSE 3.1.3 - they get stuck on DELETE:IN_PROGRESS. I have tried force-deleting with vcd cse cluster delete cluster_name --force, but get an error: Missing role-rights, ACL: To force delete cluster, Full Access ACL for entity cse:nativeCluster:2.0.0 must be granted to either the user or the org.

@jramsgaard
Copy link

@jleavers Also see the stuck clusters that cant be deleted. Running CSE 3.1.3. Tried running vcd cse cluster delete <cluster> --force newly created local VCD user that has the System Administrator role, still same issue, so rights should not be the problem?

From the log cse-server-debug.log it says pyvcloud_utils:824 - get_missing_rights_for_cluster_force_delete | Request Id: 527733fb-4752-4acf-a60c-38f58019954b | DEBUG :: missing rights of role:System Administrator--set()

@rvvliet78
Copy link
Author

I was wondering if a solution or workaround has been found for the issues described in the opening post?

I also have a bunch of cluster I can't delete with the same errors as described above.

@jleavers
Copy link

jleavers commented Jun 21, 2022

After some screen-shares on the VMware SR, we found that the VMs were rebooting and the last command that ran prior to the reboot was /var/lib/cloud/instance/scripts/runcmd. This contained the following:

#!/bin/sh
[ ! -f /root/kubeadm-defaults.conf ] && cloud-init clean && sudo reboot
[ ! -f /root/vcloud-basic-auth.yaml ] && cloud-init clean && sudo reboot
[ ! -f /root/control_plane.sh ] && cloud-init clean && sudo reboot
bash /root/control_plane.sh

The files it was checking for did seem to exist, but potentially they did not yet exist when the script ran. As a temporary workaround we removed these checks by editing the cloud-init files. I am running the CSE in a Python 3.8 venv, so for me the paths are as below, although they may vary on other setups.

  • venv/lib/python3.8/site-packages/cluster_scripts/v2_x_tkgm/cloud_init_control_plane.yaml
  • venv/lib/python3.8/site-packages/cluster_scripts/v2_x_tkgm/cloud_init_node.yaml

If you scroll to the end you will see the runcmd: section and in here you can delete the lines that have cloud-init clean && sudo reboot. After that, restart the CSE service.

I am seeing fewer cluster deployment failures after this change. There are still some that I see getting stuck, however if I log into the node and run /var/lib/cloud/instance/scripts/runcmd it will generally succeed.

For the deletions, some extra permissions had to be granted to the vCD tenant user via API, using the CSE service account credentials - this was a POST request to https://vcd-url/cloudapi/1.0.0/entityTypes/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls/ with the following payload:

{
    "grantType": "MembershipAccessControlGrant",
    "accessLevelId": "urn:vcloud:accessLevel:FullControl",
    "memberId": "urn:vcloud:user:uuid"
}

You can get the user's UUID via the vCD UI by clicking on the username in Administration -> Access Control -> Users - it will be in the URL.

NB: as shown this API call was for the nativeCluster version 2.1.0. This may differ in other environments so you would need to check the entityTypes with a GET call to https://vcd-url/cloudapi/1.0.0/entityTypes/ as well as the version for each deployed cluster with a GET call to https://vcd-url/cloudapi/1.0.0/entities/interfaces/vmware/k8s/1.0.0.

@mnspodrska
Copy link

I got similar advice to remove those sudo reboot commands from cloud_init_control_plane.yaml and cloud_init_node.yaml but it seems to me that reboot is not caused by cloud-init runcmd but vmware imc customization since /var/log/vmware-imc/toolsDeployPkg.log contains lines:

[2022-05-30T19:40:08.292Z] [ info] sSkipReboot: 'false', forceSkipReboot 'false'.

[2022-05-30T19:40:08.292Z] [ debug] Ran DeployPkg_DeployPackageFromFile successfully
[2022-05-30T19:40:08.292Z] [ debug] ## Closing log
[2022-05-30T19:40:08.292Z] [ info] Trigger reboot.

I am under the impression that cloud-init is somehow triggered before vmware imc customization is done, so it gets interrupted by imc rebooting after network configuration.

We got an idea to try to override cloud-init-local systemd service to get it to wait for imc customization to end - something like:
ExecStartPre=/bin/bash -c "sleep 1; while [ -e /tmp/.vmware-deploy.INPROGRESS ]; do sleep 1; done"

Will report what we find...

@jleavers
Copy link

jleavers commented Jun 29, 2022

I see a sleep statement being added here: [VCDA-3786] Fix for ubuntu 20 cluster creation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants