Deploying EKS control plane to mgmt VPC or app VPC #106
-
In following the guide for EKS deployment, it was unclear initially if the control plane was deployed to the app VPC or the management VPC because the management VPC peering had just been set-up. It seems the management VPC is not really used in the guide, and I think this contributed to my confusion, but I've since placed a bastion there. The docs are missing a step for establishing peering prior to adding the DNS resolver. (Establishing peering is omitted entirely, but the DNS resolver cannot be established due to errors on the SGs for being in different VPCs). I had an issue initially where the control plane terraform never seems to finish:
I needed to install
The issue here is that the guide says to make the API endpoint private, but public is required by the templates to terraform the cluster. I have completed the guide, but our nodes are not registered by the cluster. We're a little disappointed by how much the documentation and guides have diverged from the recent modules, and while we've been able to figure things out, there's been a significant time investment to get everything working properly. The above are just some of the issues we've run into. It'd be very helpful to keep the docs and guides in sync with the modules.
Do we need to provision additional IAM roles and set the mapping in the cluster in order for the nodes to be registered? Do we need to run some script? Did the registration script which invoked in the r:terraform-aws-eks |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 14 replies
-
Beta Was this translation helpful? Give feedback.
-
Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date. Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the
If you still have issues with deploying using the guide, you can try provisioning the cluster using an alternative approach. A recommended alternative to the guide is using our Service Catalog module. The Service Catalog module has less configuration freedom as you are relying on prebuilt You can deploy using the Service Catalog by doing the following:
Note that like the guide, you will want to deploy using Side note: I believe you can deploy the VPC without the DNS resolvers now. This used to be a requirement for accessing the Kubernetes API endpoint on EKS clusters with private access over a VPC peer, but as far as I know, AWS has since updated the networking infrastructure to no longer need it. The only reason I mention it is because the DNS resolvers can add up to be quite pricey (approximately $500/month), so you may want to consider removing it if you are tight on budget. Alternatively, you can consider omitting the mgmt VPC altogether and deploy the bastion/VPN server into the app VPC in the public network space. The |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the detailed response. I will try reprovisioning
the cluster using the service catalog. In the meantime, can you offer help
with one other critical issue we face? Happy to hop on a call if it's
easier (I'm on EST timezone, and I'm flexible with times).
ssh-grunt is not running properly on any server we're running it on. I
have a brief window of about 90 seconds on each box to connect as root
before even root can't connect. What's the status of this? I saw some
Github issues advocating customers switch to ec2-instance-connect. Can you
help us understand how to modify the AMIs built by packer for this and how
to proceed generally? Short of disabling SSH grunt on the boxes, we're
effectively locked out until this is resolved and debugging on the client
end with 90 second windows before needing to reprovision the infra is just
not a great experience.
…On Mon, Jan 3, 2022 at 4:16 PM Yoriyasu Yano ***@***.***> wrote:
Hello, apologies for the frustration and challenges with using the guide.
We are aware of how out of date the guide is and are intending on
overhauling both the guide contents and process to ensure that they stay up
to date.
Regarding the issues with node registration, you should not need to do
anything beyond making sure the worker ASG IAM roles are included in the
eks_worker_iam_role_arns attribute for the call to the
eks-k8s-role-mapping module. I suspect there were some issues with the
IAM role mapping creation when you ran into issues with the private API
endpoint setup. I would check the following things to troubleshoot this
issue:
- Introspect the aws-auth ConfigMap to make sure it has the worker IAM
role in the configuration. You can use kubectl to retrieve the config
map directly form the cluster: kubectl describe configmap aws-auth -n
kube-system.
- If the ConfigMap is correct, then SSH into the running nodes and
introspect the kubelet logs for more info. You should be able to find
the error logs in either syslog,/var/log/messages (e.g., try running sudo
tail /var/log/messages | grep kubelet). This should give you some
insights into what might be causing the issue.
------------------------------
If you still have issues with deploying using the guide, you can try
provisioning the cluster using an alternative approach. A recommended
alternative to the guide is using our Service Catalog module
<https://github.com/gruntwork-io/terraform-aws-service-catalog/tree/master/modules/services/eks-cluster>.
The Service Catalog module has less configuration freedom as you are
relying on prebuilt infrastructure-modules modules, but may work better
as a starting point.
You can deploy using the Service Catalog by doing the following:
1. Build the AMI using the provided packer template
<https://github.com/gruntwork-io/terraform-aws-service-catalog/blob/master/modules/services/eks-workers/eks-node-al2.pkr.hcl>.
To do so, git clone the service catalog repo and run cd
modules/services/eks-workers && packer build -var="version_tag=v0.68.7"
-var="service_catalog_ref=v0.68.7" -var="aws_region=YOUR_AWS_REGION"
eks-node-al2.pkr.hcl. Note that you may want to pass in additional -var
inputs depending on your needs.
2. Use the following updated terragrunt config, with all the <>
variables updated to the real values for your environment:
terraform {
source = ***@***.***/gruntwork-io/terraform-aws-service-catalog.git//modules/services/eks-cluster?ref=v0.68.7"
}
include {
path = find_in_parent_folders()
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "<YOUR_AWS_REGION>"
}
EOF
}
inputs = {
cluster_name = "eks-stage"
cluster_instance_keypair_name = "stage-services-us-east-1-v1"
vpc_id = "<APP_VPC_ID>"
control_plane_vpc_subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"]
allow_inbound_api_access_from_cidr_blocks = ["0.0.0.0/0"]
allow_private_api_access_from_cidr_blocks = [
"<CIDR_BLOCK_OF_APP_VPC>",
"<CIDR_BLOCK_OF_MGMT_VPC>",
]
endpoint_public_access = true # Set to false for private API
# Fill in the ID of the AMI you built from your Packer template
cluster_instance_ami = "<AMI_ID>"
# Set the max size to double the min size so the extra capacity can be used to do a zero-downtime deployment of updates
# to the EKS Cluster Nodes (e.g. when you update the AMI). For docs on how to roll out updates to the cluster, see:
# https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-cluster-workers#how-do-i-roll-out-an-update-to-the-instances
autoscaling_group_configurations = {
asg = {
min_size = 3
max_size = 6
asg_instance_type = "t2.small"
subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"]
}
}
# If your IAM users are defined in a separate AWS account (e.g., in a security account), pass in the ARN of an IAM
# role in that account that ssh-grunt on the worker nodes can assume to look up IAM group membership and public SSH
# keys
external_account_ssh_grunt_role_arn = "arn:aws:iam::1111222233333:role/allow-ssh-grunt-access-from-other-accounts"
# Configure your role mappings
iam_role_to_rbac_group_mappings = {
# Give anyone using the full-access IAM role admin permissions
"arn:aws:iam::444444444444:role/allow-full-access-from-other-accounts" = ["system:masters"]
# Give anyone using the developers IAM role developer permissions. Kubernetes will automatically create this group
# if it doesn't exist already, but you're still responsible for binding permissions to it!
"arn:aws:iam::444444444444:role/allow-dev-access-from-other-accounts" = ["developers"]
}
}
Note that like the guide, you will want to deploy using endpoint_public_access
= true first, and then switching to endpoint_public_access = false due to
the network access issues you ran into. Alternatively, you can deploy
through a VPN connection that allows you to VPN into the mgmt VPC.
------------------------------
Side note: I believe you can deploy the VPC without the DNS resolvers now.
This used to be a requirement for accessing the Kubernetes API endpoint on
EKS clusters with private access over a VPC peer, but as far as I know, AWS
has since updated the networking infrastructure to no longer need it. The
only reason I mention it is because the DNS resolvers can add up to be
quite pricey (approximately $500/month), so you may want to consider
removing it if you are tight on budget.
Alternatively, you can consider omitting the mgmt VPC altogether and
deploy the bastion/VPN server into the app VPC in the public network space.
The mgmt VPC architecture is most useful/recommended if you intend on
having more than one VPC for your applications. Otherwise, it can be
unnecessary overhead. It is fairly straightforward to introduce one after
the fact as well, so you may want to consider a single VPC architecture if
you don't have the networking needs.
—
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAT3O6EF4PVJT3D5EKJHJHLUUIG3NANCNFSM5KO3BYZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Can you elaborate what you mean when you say "I have a brief window of about 90 seconds on each box to connect as root before even root can't connect?" Do you mean accessing the default user (e.g., |
Beta Was this translation helpful? Give feedback.
Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date.
Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the
eks_worker_iam_role_arns
attribute for the call to theeks-k8s-role-mapping
module. I suspect there were some issues with the IAM role mapping creation when you ran into issues with the private API endpoint setup. I would check the following things to troubleshoot this issue:aws-auth
ConfigMap to make sure it …