-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico vpp failed to install on k8s cluster with dpdk as vpp_uplink_driver #465
Comments
Hey guys, Any workaround suggestions or fixes for this issue? |
Hi @dylan2intel, could you try it in ubuntu 20.04? |
Currently I can workaround this issue with static ip configured as below, However if dhcp enabled, it always fails. Why? My internal network is under control of a physic switch with ip/mac address mapping. cat /etc/netplan/00-installer-config.yaml
I think this workaround is a huge limitation, root cause still need to address. |
Another new finding is that once I deploy it successfully by above workaround, then I deploy a pod with service exposed by external IP, as long as external IP specified with ip address(kubectl get node -o wide), the whole calico vpp k8s cluster will fail back, However, if I specify with another ip or expose the pod service without external IP, it works fine. |
Enabling DHCP by itself should not cause the deploy failure. For example, in cloud environments DHCP is enabled and instances/nodes obtain the DNS/IP addr etc using DHCP. However, point to note is that the IP address remains the same for the lifetime of the instance/node. How is DHCP configured in your case? Are the IP addrs handed out by the DHCP server to the BM node/system in question dynamic or static? Whats the lease period? What is the network management backend and how does it handle DHCP leases? Could you check the logs to see anything peculiar wrt DHCP? |
Could you share how you are creating the service and could you share the logs when this failure scenario happens?
|
Here you can see that the
|
New finding from the route table, it seems VPP created a route on incorrect interface.
10.244.0.0/16 via 192.168.126.254 dev ens817f1 proto static mtu 1440 // this route is incorrect, the expected interface should be |
Continuous update new findings
@onong Could you please double-check if this is a bug? thx. |
Hi @dylan2intel, thank you for sharing the detailed logs and for all the digging! Looking more closely, the following might explain some of the anomalies:
These are conflicting routes. Which of the two interfaces, ens817f0/ens817f1, do packets destined for 192.168.126.0/23 go to? This would also explain why things work when only one port is used or when different subnets are used.
The route in question is just a config magic that we do so that the pods are reachable from linux as well as vpp. The interface chosen, ens817f1, is incorrect but I believe it is caused by the fact that the nexthop 192.168.126.254 is reachable via both ens817f0 and ens817f1 and the kernel chose ens817f1. |
BTW, due to continuous observing I found that even if the ip route is correct, however it also may get stuck on coredns pod creating, but if I re-run it will succeed, very strange, good news is that the failure frequency become lower. Absolutely magic for the next hop, suppose if the networking switched into DHCP and assuming that the two ports allocated ips under the same subnet, how can it work? |
This is strange. Could you share the
|
Not sure I understood. Does your use case require both ports to be in the same subnet? You may run into |
Hi @onong Bad luck that it reproduced today with correct route, here is for your information to address the root cause. kubectl get pods -A -o wide
ip route
kubectl exec -it -n calico-vpp-dataplane calico-vpp-node-lgklc -c vpp -- bashvppctl show cnat translation
vppctl sh ip fib
|
Not my case requiring both ports, because Intel's NIC interface is with two ports physically. |
New finding that it always happens in such scenario that the created Here is a workaround that when I change back the physical mac address, the pods fall back to ready.
Can we fix it by always specifying the physical mac address to create the |
I kinda had the same problems with coredns and calico kube controller pod, but doing the followings saved my life!
finally I'm glad that your problem resolved though I don't much understand the details, being a newbie and all :) |
Environment
Issue description
Follow the official getting-started to install calico vpp failed to bring up on-premise k8s cluster.
How does it work if the k8s cluster node ip bound with dpdk driver by calico vpp when coredns pod is initializing and not ready?
Since coredns will reach DNS server with UDP per kernel driver.
To Reproduce
Steps to reproduce the behavior:
Two network interfaces, one 10G for public ip, another 100G NIC for private ip
k8s will use 100G NIC to setup cluster, which will be the node ip(kubectl get node -o wide)
load vfio-pci kernel driver
echo "vfio-pci" > /etc/modules-load.d/95-vpp.conf modprobe vfio-pci
setup hugepages
echo "vm.nr_hugepages = 512" >> /etc/sysctl.conf sysctl -p sudo systemctl restart containerd.service sudo systemctl restart kubelet.service
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.5/manifests/tigera-operator.yaml
curl -o https://raw.githubusercontent.com/projectcalico/vpp-dataplane/v3.24.0/yaml/calico/installation-default.yaml
edit installation-default.yaml, specify
cidr
with10.244.0.0/16
kubectl apply -f installation-default.yaml
curl -o calico-vpp.yaml https://raw.githubusercontent.com/projectcalico/vpp-dataplane/v3.24.0/yaml/generated/calico-vpp.yaml
edit calico-vpp.yaml
specify
vpp_dataplane_interface
withdpdk
specify
vpp_dataplane_interface
with ip (kubectl get nodes -o wide)kubectl apply -f calico-vpp.yaml
Expected behavior
k8s cluster install successfully with calico vpp as CNI
Additional context
The text was updated successfully, but these errors were encountered: