Deploying EKS control plane to mgmt VPC or app VPC #106

NathanielWroblewski · 2021-12-20T23:34:30Z

NathanielWroblewski
Dec 20, 2021

In following the guide for EKS deployment, it was unclear initially if the control plane was deployed to the app VPC or the management VPC because the management VPC peering had just been set-up. It seems the management VPC is not really used in the guide, and I think this contributed to my confusion, but I've since placed a bastion there.

The docs are missing a step for establishing peering prior to adding the DNS resolver. (Establishing peering is omitted entirely, but the DNS resolver cannot be established due to errors on the SGs for being in different VPCs).

I had an issue initially where the control plane terraform never seems to finish:

odule.eks_cluster.null_resource.wait_for_api: Still creating... [19m20s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [19m30s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [19m40s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [19m50s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m0s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m10s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m20s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m30s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m40s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [20m50s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m0s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m10s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m20s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m30s elapsed]
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m40s elapsed]
module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="Error retrieiving info from endpoint: Head \"https://REDACTED.eks.amazonaws.com\": dial tcp REDACTED:443: connect: operation timed out" name=kubergrunt
module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="Marking api server as not ready" name=kubergrunt
module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="EKS cluster arn:aws:eks:REDACTED:cluster/REDACTED Kubernetes api server is not active yet" name=kubergrunt
module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=info msg="Waiting for 15s..." name=kubergrunt
module.eks_cluster.null_resource.wait_for_api: Still creating... [21m50s elapsed]

I needed to install kubergrunt. Having done that, I still get issues:

ERROR: Get
│ "https://REDACTED.eks.amazonaws.com/apis/apps/v1/namespaces/kube-system/daemonsets/kube-proxy":
│ dial tcp REDACTED:443: i/o timeout

The issue here is that the guide says to make the API endpoint private, but public is required by the templates to terraform the cluster.

I have completed the guide, but our nodes are not registered by the cluster. We're a little disappointed by how much the documentation and guides have diverged from the recent modules, and while we've been able to figure things out, there's been a significant time investment to get everything working properly. The above are just some of the issues we've run into. It'd be very helpful to keep the docs and guides in sync with the modules.

[] INFO[2021-12-27T21:45:51-05:00] Not all nodes are registered yet              name=kubergrunt
[] INFO[2021-12-27T21:45:51-05:00] Waiting for 15s...                            name=kubergrunt
[] INFO[2021-12-27T21:46:06-05:00] Checking if nodes ready                       name=kubergrunt
[] INFO[2021-12-27T21:46:06-05:00] Not all nodes are registered yet              name=kubergrunt
[] INFO[2021-12-27T21:46:06-05:00] Waiting for 15s...                            name=kubergrunt
[] INFO[2021-12-27T21:46:21-05:00] Checking if nodes ready                       name=kubergrunt
[] INFO[2021-12-27T21:46:21-05:00] Not all nodes are registered yet              name=kubergrunt
[] INFO[2021-12-27T21:46:21-05:00] Waiting for 15s...                            name=kubergrunt
[] INFO[2021-12-27T21:46:36-05:00] Checking if nodes ready                       name=kubergrunt
[] INFO[2021-12-27T21:46:36-05:00] Not all nodes are registered yet              name=kubergrunt

Do we need to provision additional IAM roles and set the mapping in the cluster in order for the nodes to be registered? Do we need to run some script? Did the registration script which invoked in the user-data not run successfully? The docs do not address these issues or how to proceed. What should we be checking for node registration issues?

r:terraform-aws-eks

Answered by yorinasub17

Jan 3, 2022

Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date.

Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the eks_worker_iam_role_arns attribute for the call to the eks-k8s-role-mapping module. I suspect there were some issues with the IAM role mapping creation when you ran into issues with the private API endpoint setup. I would check the following things to troubleshoot this issue:

Introspect the aws-auth ConfigMap to make sure it …

View full answer

gruntwork-discussions · 2021-12-20T23:34:32Z

gruntwork-discussions
Dec 20, 2021
Collaborator

You must label this issue with at least one of the following labels before it can be answered:

Community Guidelines
Getting help
Gruntwork documentation

0 replies

yorinasub17 · 2022-01-03T21:16:27Z

yorinasub17
Jan 3, 2022

Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date.

Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the eks_worker_iam_role_arns attribute for the call to the eks-k8s-role-mapping module. I suspect there were some issues with the IAM role mapping creation when you ran into issues with the private API endpoint setup. I would check the following things to troubleshoot this issue:

Introspect the aws-auth ConfigMap to make sure it has the worker IAM role in the configuration. You can use kubectl to retrieve the config map directly form the cluster: kubectl describe configmap aws-auth -n kube-system.
If the ConfigMap is correct, then SSH into the running nodes and introspect the kubelet logs for more info. You should be able to find the error logs in either syslog,/var/log/messages (e.g., try running sudo tail /var/log/messages | grep kubelet). This should give you some insights into what might be causing the issue.

If you still have issues with deploying using the guide, you can try provisioning the cluster using an alternative approach. A recommended alternative to the guide is using our Service Catalog module. The Service Catalog module has less configuration freedom as you are relying on prebuilt infrastructure-modules modules, but may work better as a starting point.

You can deploy using the Service Catalog by doing the following:

Build the AMI using the provided packer template. To do so, git clone the service catalog repo and run cd modules/services/eks-workers && packer build -var="version_tag=v0.68.7" -var="service_catalog_ref=v0.68.7" -var="aws_region=YOUR_AWS_REGION" eks-node-al2.pkr.hcl. Note that you may want to pass in additional -var inputs depending on your needs.
Use the following updated terragrunt config, with all the <> variables updated to the real values for your environment:

terraform {
  source = "[email protected]/gruntwork-io/terraform-aws-service-catalog.git//modules/services/eks-cluster?ref=v0.68.7"
}

include {
  path = find_in_parent_folders()
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "<YOUR_AWS_REGION>"
}
EOF
}

inputs = {
  cluster_name                  = "eks-stage"
  cluster_instance_keypair_name = "stage-services-us-east-1-v1"
  
  vpc_id = "<APP_VPC_ID>"
  control_plane_vpc_subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"]
  allow_inbound_api_access_from_cidr_blocks = ["0.0.0.0/0"]
  allow_private_api_access_from_cidr_blocks = [
    "<CIDR_BLOCK_OF_APP_VPC>",
    "<CIDR_BLOCK_OF_MGMT_VPC>",
  ]
  endpoint_public_access = true  # Set to false for private API

  # Fill in the ID of the AMI you built from your Packer template
  cluster_instance_ami          = "<AMI_ID>"

  # Set the max size to double the min size so the extra capacity can be used to do a zero-downtime deployment of updates
  # to the EKS Cluster Nodes (e.g. when you update the AMI). For docs on how to roll out updates to the cluster, see:
  # https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-cluster-workers#how-do-i-roll-out-an-update-to-the-instances
  autoscaling_group_configurations = {
    asg = {
      min_size      = 3
      max_size      = 6
      asg_instance_type = "t2.small"
      subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"]
    }
  }

  # If your IAM users are defined in a separate AWS account (e.g., in a security account), pass in the ARN of an IAM
  # role in that account that ssh-grunt on the worker nodes can assume to look up IAM group membership and public SSH
  # keys
  external_account_ssh_grunt_role_arn = "arn:aws:iam::1111222233333:role/allow-ssh-grunt-access-from-other-accounts"

  # Configure your role mappings
  iam_role_to_rbac_group_mappings = {
    # Give anyone using the full-access IAM role admin permissions
    "arn:aws:iam::444444444444:role/allow-full-access-from-other-accounts" = ["system:masters"]

    # Give anyone using the developers IAM role developer permissions. Kubernetes will automatically create this group
    # if it doesn't exist already, but you're still responsible for binding permissions to it!
    "arn:aws:iam::444444444444:role/allow-dev-access-from-other-accounts" = ["developers"]
  }
}

Note that like the guide, you will want to deploy using endpoint_public_access = true first, and then switching to endpoint_public_access = false due to the network access issues you ran into. Alternatively, you can deploy through a VPN connection that allows you to VPN into the mgmt VPC.

Side note: I believe you can deploy the VPC without the DNS resolvers now. This used to be a requirement for accessing the Kubernetes API endpoint on EKS clusters with private access over a VPC peer, but as far as I know, AWS has since updated the networking infrastructure to no longer need it. The only reason I mention it is because the DNS resolvers can add up to be quite pricey (approximately $500/month), so you may want to consider removing it if you are tight on budget.

Alternatively, you can consider omitting the mgmt VPC altogether and deploy the bastion/VPN server into the app VPC in the public network space. The mgmt VPC architecture is most useful/recommended if you intend on having more than one VPC for your applications. Otherwise, it can be unnecessary overhead. It is fairly straightforward to introduce one after the fact as well, so you may want to consider a single VPC architecture if you don't have the networking needs.

0 replies

NathanielWroblewski · 2022-01-06T19:25:15Z

NathanielWroblewski
Jan 6, 2022
Author

Thank you very much for the detailed response. I will try reprovisioning the cluster using the service catalog. In the meantime, can you offer help with one other critical issue we face? Happy to hop on a call if it's easier (I'm on EST timezone, and I'm flexible with times). ssh-grunt is not running properly on any server we're running it on. I have a brief window of about 90 seconds on each box to connect as root before even root can't connect. What's the status of this? I saw some Github issues advocating customers switch to ec2-instance-connect. Can you help us understand how to modify the AMIs built by packer for this and how to proceed generally? Short of disabling SSH grunt on the boxes, we're effectively locked out until this is resolved and debugging on the client end with 90 second windows before needing to reprovision the infra is just not a great experience.

…

On Mon, Jan 3, 2022 at 4:16 PM Yoriyasu Yano ***@***.***> wrote: Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date. Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the eks_worker_iam_role_arns attribute for the call to the eks-k8s-role-mapping module. I suspect there were some issues with the IAM role mapping creation when you ran into issues with the private API endpoint setup. I would check the following things to troubleshoot this issue: - Introspect the aws-auth ConfigMap to make sure it has the worker IAM role in the configuration. You can use kubectl to retrieve the config map directly form the cluster: kubectl describe configmap aws-auth -n kube-system. - If the ConfigMap is correct, then SSH into the running nodes and introspect the kubelet logs for more info. You should be able to find the error logs in either syslog,/var/log/messages (e.g., try running sudo tail /var/log/messages | grep kubelet). This should give you some insights into what might be causing the issue. ------------------------------ If you still have issues with deploying using the guide, you can try provisioning the cluster using an alternative approach. A recommended alternative to the guide is using our Service Catalog module <https://github.com/gruntwork-io/terraform-aws-service-catalog/tree/master/modules/services/eks-cluster>. The Service Catalog module has less configuration freedom as you are relying on prebuilt infrastructure-modules modules, but may work better as a starting point. You can deploy using the Service Catalog by doing the following: 1. Build the AMI using the provided packer template <https://github.com/gruntwork-io/terraform-aws-service-catalog/blob/master/modules/services/eks-workers/eks-node-al2.pkr.hcl>. To do so, git clone the service catalog repo and run cd modules/services/eks-workers && packer build -var="version_tag=v0.68.7" -var="service_catalog_ref=v0.68.7" -var="aws_region=YOUR_AWS_REGION" eks-node-al2.pkr.hcl. Note that you may want to pass in additional -var inputs depending on your needs. 2. Use the following updated terragrunt config, with all the <> variables updated to the real values for your environment: terraform { source = ***@***.***/gruntwork-io/terraform-aws-service-catalog.git//modules/services/eks-cluster?ref=v0.68.7" } include { path = find_in_parent_folders() } generate "provider" { path = "provider.tf" if_exists = "overwrite_terragrunt" contents = <<EOF provider "aws" { region = "<YOUR_AWS_REGION>" } EOF } inputs = { cluster_name = "eks-stage" cluster_instance_keypair_name = "stage-services-us-east-1-v1" vpc_id = "<APP_VPC_ID>" control_plane_vpc_subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"] allow_inbound_api_access_from_cidr_blocks = ["0.0.0.0/0"] allow_private_api_access_from_cidr_blocks = [ "<CIDR_BLOCK_OF_APP_VPC>", "<CIDR_BLOCK_OF_MGMT_VPC>", ] endpoint_public_access = true # Set to false for private API # Fill in the ID of the AMI you built from your Packer template cluster_instance_ami = "<AMI_ID>" # Set the max size to double the min size so the extra capacity can be used to do a zero-downtime deployment of updates # to the EKS Cluster Nodes (e.g. when you update the AMI). For docs on how to roll out updates to the cluster, see: # https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-cluster-workers#how-do-i-roll-out-an-update-to-the-instances autoscaling_group_configurations = { asg = { min_size = 3 max_size = 6 asg_instance_type = "t2.small" subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"] } } # If your IAM users are defined in a separate AWS account (e.g., in a security account), pass in the ARN of an IAM # role in that account that ssh-grunt on the worker nodes can assume to look up IAM group membership and public SSH # keys external_account_ssh_grunt_role_arn = "arn:aws:iam::1111222233333:role/allow-ssh-grunt-access-from-other-accounts" # Configure your role mappings iam_role_to_rbac_group_mappings = { # Give anyone using the full-access IAM role admin permissions "arn:aws:iam::444444444444:role/allow-full-access-from-other-accounts" = ["system:masters"] # Give anyone using the developers IAM role developer permissions. Kubernetes will automatically create this group # if it doesn't exist already, but you're still responsible for binding permissions to it! "arn:aws:iam::444444444444:role/allow-dev-access-from-other-accounts" = ["developers"] } } Note that like the guide, you will want to deploy using endpoint_public_access = true first, and then switching to endpoint_public_access = false due to the network access issues you ran into. Alternatively, you can deploy through a VPN connection that allows you to VPN into the mgmt VPC. ------------------------------ Side note: I believe you can deploy the VPC without the DNS resolvers now. This used to be a requirement for accessing the Kubernetes API endpoint on EKS clusters with private access over a VPC peer, but as far as I know, AWS has since updated the networking infrastructure to no longer need it. The only reason I mention it is because the DNS resolvers can add up to be quite pricey (approximately $500/month), so you may want to consider removing it if you are tight on budget. Alternatively, you can consider omitting the mgmt VPC altogether and deploy the bastion/VPN server into the app VPC in the public network space. The mgmt VPC architecture is most useful/recommended if you intend on having more than one VPC for your applications. Otherwise, it can be unnecessary overhead. It is fairly straightforward to introduce one after the fact as well, so you may want to consider a single VPC architecture if you don't have the networking needs. — Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAT3O6EF4PVJT3D5EKJHJHLUUIG3NANCNFSM5KO3BYZQ> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

12 replies

yorinasub17 Jan 6, 2022

Last remaining open issue from me is the ssh-grunt/ec2-instance-connect issues. What's the best way to get resolution on those?

I'm actively investigating this issue. We haven't had an issue with ssh-grunt in a while so I am struggling a bit to identify a proper repro and fix for your situation. I am hoping to have some general direction for you to share sometime tomorrow morning.

yorinasub17 Jan 7, 2022

One thing you can look into in the meantime: if you are using the service catalog to deploy the bastion host/openvpn server, then it should be configured to ship the log entries to cloudwatch logs under a log group with the name of the host. The syslog for the server should contain the logs for the ssh-grunt runs. Are you able to find any errors in there that might indicate what is going wrong with ssh-grunt?

NathanielWroblewski Jan 7, 2022
Author

I'll take a look and report back, appreciate your help with this.

NathanielWroblewski Jan 7, 2022
Author

Hey, please disregard, things appear to be working. I'm not sure what was done or who fixed, wish I had more to report, but we seem to be okay now.

yorinasub17 Jan 7, 2022

Awesome! Glad to hear everything is working now!

yorinasub17 · 2022-01-06T19:32:42Z

yorinasub17
Jan 6, 2022

Can you elaborate what you mean when you say "I have a brief window of about 90 seconds on each box to connect as root before even root can't connect?" Do you mean accessing the default user (e.g., ubuntu or ec2-user) over SSH using the EC2 keypair or something else? I haven't encountered the behavior you mentioned before (where you lose access after boot), so trying to understand what mechanism you are currently using for this.

2 replies

NathanielWroblewski Jan 6, 2022
Author

Correct. Using the keypair and connecting as ubuntu. When connecting with a user, we will often get an invalid pub key message. We have been using your IAM user's module to attach an RSA pubkey to the IAM user, and we have verified that the key used to connect is correct. Connecting as ubuntu only seems to work until ssh-grunt runs 90 seconds later, and we get booted and can no longer connect. We are running on Ubuntu 20.

NathanielWroblewski Jan 6, 2022
Author

For bastion and VPN, we are using the service-catalog modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gruntwork

Deploying EKS control plane to mgmt VPC or app VPC #106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Gruntwork

Deploying EKS control plane to mgmt VPC or app VPC #106

NathanielWroblewski Dec 20, 2021

Replies: 4 comments · 14 replies

gruntwork-discussions Dec 20, 2021 Collaborator

yorinasub17 Jan 3, 2022

NathanielWroblewski Jan 6, 2022 Author

yorinasub17 Jan 6, 2022

yorinasub17 Jan 7, 2022

NathanielWroblewski Jan 7, 2022 Author

NathanielWroblewski Jan 7, 2022 Author

yorinasub17 Jan 7, 2022

yorinasub17 Jan 6, 2022

NathanielWroblewski Jan 6, 2022 Author

NathanielWroblewski Jan 6, 2022 Author

NathanielWroblewski
Dec 20, 2021

Replies: 4 comments 14 replies

gruntwork-discussions
Dec 20, 2021
Collaborator

yorinasub17
Jan 3, 2022

NathanielWroblewski
Jan 6, 2022
Author

NathanielWroblewski Jan 7, 2022
Author

NathanielWroblewski Jan 7, 2022
Author

yorinasub17
Jan 6, 2022

NathanielWroblewski Jan 6, 2022
Author

NathanielWroblewski Jan 6, 2022
Author