Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPP-dataplane Setup Failure, CoreDNS and calico-kube-controllers fail to reach kubeAPI Server #685

Open
umarfarooq-git opened this issue Apr 1, 2024 · 18 comments
Assignees

Comments

@umarfarooq-git
Copy link

Environment

  • Calico/VPP version: v3.27.0
  • all pods in calico-system namespace: v3.27.2
  • tigera-operator: v1.32.5
  • Kubernetes version:
    Client Version: v1.28.8
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.28.8
  • Deployment type: Vagrant VM with following details.
IP_NW = "192.168.56."
MASTER_IP_START = 1
Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/bionic64"
config.vm.box_check_update = false

#Provision Master Nodes
(1..NUM_MASTER_NODE).each do |i|
config.vm.define "kubemaster" do |node|
# Name shown in the GUI
node.vm.provider "virtualbox" do |vb|
vb.name = "kubemaster"
vb.memory = 4096
vb.cpus = 4 
end
node.vm.hostname = "kubemaster"
node.vm.network :private_network, ip: IP_NW + "#{MASTER_IP_START + i}"
node.vm.network "forwarded_port", guest: 22, host: "#{2710 + i}"
node.vm.network "private_network", ip: "192.168.56.10", virtualbox__hostonly: true
node.vm.provision "setup-hosts", :type => "shell", :path => "ubuntu/vagrant/setup-hosts.sh" do |s|
s.args = ["enp0s8"]
end
node.vm.provision "setup-dns", type: "shell", :path => "ubuntu/update-dns.sh"
end
end
end
  • Network configuration:
Capture

Issue description
CoreDNS and calico-kube-controllers not able to run. CrashLoopBackOff, Probably both of these pods are trying to connect API server at wrong IP. Logs are provided below.

To Reproduce
Steps to reproduce the behavior:

  1. Bring up the VM according to the given Vagrant settings.
  2. I disable the firewall on all machines with command #ufw disable
  3. Disable swap
    sudo swapoff -a
  4. Forwarding IPv4 and letting iptables see bridged traffic
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

# sysctl params required by setup, params persist across reboots
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

# Apply sysctl params without reboot
sudo sysctl --system
  1. Installed containerd runtime and Configure system as Cgroup driver by putting following details in /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true
  1. Installed kubeadm, kubelet and kubectl using apt repository.
  2. Initiated K8S cluster as following.
    kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=192.168.56.2

To Install Calico with the VPP dataplane

Followed the instruction from here. https://docs.tigera.io/calico/latest/getting-started/kubernetes/vpp/getting-started

  1. I assign the huge pages with
    echo "vfio_pci" > /etc/modules-load.d/95-vpp.conf
    modprobe vfio_pci
    echo "vm.nr_hugepages = 512" >> /etc/sysctl.conf
    sysctl -p
    systemctl restart kubelet

  2. kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.2/manifests/tigera-operator.yaml

  3. kubectl apply -f https://raw.githubusercontent.com/projectcalico/vpp-dataplane/v3.27.0/yaml/calico/installation-default.yaml

  4. curl -o calico-vpp.yaml https://raw.githubusercontent.com/projectcalico/vpp-dataplane/v3.27.0/yaml/generated/calico-vpp.yaml
    Keep all settings unchanged except for modifying the vpp_dataplane_interface in the configuration file. The remaining configurations will stay the same as those in my cluster, with the service_prefix retaining its default value of 10.96.0.0/12.
    kind: ConfigMap
    apiVersion: v1
    metadata:
    name: calico-config
    namespace: calico-vpp-dataplane
    data:
    service_prefix: 10.96.0.0/12
    vpp_dataplane_interface: enp0s8
    vpp_uplink_driver: ""

Have spent days on it 😞
Want to find out if there is some issue with the version of calico or I am doing something wrong. Bcasuse It works like charm if I simply configure calico CNI using following Link and don't setup VPP dataplane at all.
kubectl create -f https://docs.projectcalico.org/v3.15/manifests/calico.yaml

Expected behavior
I want to setup calico VPP dataplane to test the VPP different available VPP drivers like DPDK for traffic acceleration.

Additional context
Logs of Pods creating issue.

k get pods -A -o wide
Capture

calico-kube-controllers
Capture

Capture

Similar issues
#217
projectcalico/calico#6227

@lwr20 @AloysAugustin @Josh-Tigera Would be grateful for your help everyone.

@onong
Copy link
Collaborator

onong commented Apr 1, 2024

Hi @umarfarooq-git, sorry to hear about the troubles you have been having. Could you share the calico/vpp ds logs?

kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp

@umarfarooq-git
Copy link
Author

@onong, Thank you for your response. logs of cali-vpp-node are as below

vagrant@kubemaster:$ sudo kubectl logs calico-vpp-node-2p6kj -n calico-vpp-dataplane -c vpp
time="2024-04-01T02:51:37Z" level=info msg="Version info\nImage tag : ab81a77\nVPP-dataplane version : ab81a77 Release v3.27.0\nVPP Version : 24.02-rc0
8-g9db45f6ae\nBinapi-generator version : v0.8.0\nVPP Base commit : 06efd532e gerrit:34726/3 interface: add buffer stats api\n------------------ Cherry picked commits --------------------\ncapo: Calico Policies plugin\nacl: acl-plugin custom policies\ncnat: [WIP] no k8s maglev from pods\npbl: Port based balancer\ngerrit:40078/3 vnet: allow format deleted swifidx\ngerrit:40090/3 cnat: undo fib_entry_contribute_forwarding\ngerrit:39507/13 cnat: add flow hash config to cnat translation\ngerrit:34726/3 interface: add buffer stats api\n-------------------------------------------------------------\n"
time="2024-04-01T02:51:37Z" level=info msg="Config:NODENAME=kubemaster"
time="2024-04-01T02:51:37Z" level=info msg="Config:SERVICE_PREFIX=[10.96.0.0/12]"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_NATIVE_DRIVER="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_INIT_SCRIPT_TEMPLATE="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_IF_READ=#!/bin/sh\n\nHOOK="$0"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; fixing dns..."\n sed -i "s/\[main\]/\[main\]\ndns=none/" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; undoing dns fix..."\n sed -i "0,/dns=none/{/dns=none/d;}" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo "default_hook: system is using systemd-networkd; restarting..."\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; restarting..."\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo "default_hook: system is using networking service; restarting..."\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo "default_hook: system is using network service; restarting..."\n systemctl restart network\n else\n echo "default_hook: Networking backend not detected, network configuration may fail"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo "default_hook: using systemctl..."\nelse\n echo "default_hook: Init system not supported, network configuration may fail"\n exit 1\nfi\n\nif [ "$HOOK" = "BEFORE_VPP_RUN" ]; then\n fix_dns\nelif [ "$HOOK" = "VPP_RUNNING" ]; then\n restart_network\nelif [ "$HOOK" = "VPP_DONE_OK" ]; then\n undo_dns_fix\n restart_network\nelif [ "$HOOK" = "VPP_ERRORED" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_FEATURE_GATES={}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_LOG_FORMAT="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_CONFIG_TEMPLATE=unix {\n nodaemon\n full-coredump\n cli-listen /var/run/vpp/cli.sock\n pidfile /run/vpp/vpp.pid\n exec /etc/vpp/startup.exec\n}\napi-trace { on }\ncpu {\n workers 0\n}\nsocksvr {\n socket-name /var/run/vpp/vpp-api.sock\n}\nplugins {\n plugin default { enable }\n plugin dpdk_plugin.so { disable }\n plugin calico_plugin.so { enable }\n plugin ping_plugin.so { disable }\n plugin dispatch_trace_plugin.so { enable }\n}\nbuffers {\n buffers-per-numa 131072\n}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_HOOK_VPP_ERRORED=#!/bin/sh\n\nHOOK="$0"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; fixing dns..."\n sed -i "s/\[main\]/\[main\]\ndns=none/" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; undoing dns fix..."\n sed -i "0,/dns=none/{/dns=none/d;}" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo "default_hook: system is using systemd-networkd; restarting..."\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; restarting..."\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo "default_hook: system is using networking service; restarting..."\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo "default_hook: system is using network service; restarting..."\n systemctl restart network\n else\n echo "default_hook: Networking backend not detected, network configuration may fail"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo "default_hook: using systemctl..."\nelse\n echo "default_hook: Init system not supported, network configuration may fail"\n exit 1\nfi\n\nif [ "$HOOK" = "BEFORE_VPP_RUN" ]; then\n fix_dns\nelif [ "$HOOK" = "VPP_RUNNING" ]; then\n restart_network\nelif [ "$HOOK" = "VPP_DONE_OK" ]; then\n undo_dns_fix\n restart_network\nelif [ "$HOOK" = "VPP_ERRORED" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_SWAP_DRIVER="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_CONFIG_EXEC_TEMPLATE="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_IPSEC_IKEV2_PSK="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_DEBUG={}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_INTERFACES={\n "defaultPodIfSpec": {\n "rx": 1,\n "tx": 1,\n "rxqsz": 0,\n "txqsz": 0,\n "isl3": true,\n "rxMode": 0\n },\n "maxPodIfSpec": {\n "rx": 10,\n "tx": 10,\n "rxqsz": 1024,\n "txqsz": 1024,\n "isl3": null,\n "rxMode": 0\n },\n "vppHostTapSpec": {\n "rx": 1,\n "tx": 1,\n "rxqsz": 1024,\n "txqsz": 1024,\n "isl3": false,\n "rxMode": 0\n },\n "uplinkInterfaces": [\n {\n "rx": 0,\n "tx": 0,\n "rxqsz": 0,\n "txqsz": 0,\n "isl3": null,\n "rxMode": 0,\n "isMain": false,\n "physicalNetworkName": "",\n "interfaceName": "enp0s8",\n "vppDriver": "",\n "newDriver": "",\n "annotations": null,\n "mtu": 0\n }\n ]\n}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_IPSEC={\n "nbAsyncCryptoThreads": 0,\n "extraAddresses": 0\n}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_SRV6={\n "localsidPool": "",\n "policyPool": ""\n}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_INITIAL_CONFIG={\n "vppStartupSleepSeconds": 1,\n "corePattern": "/var/lib/vpp/vppcore.%e.%p",\n "extraAddrCount": 0,\n "ifConfigSavePath": "",\n "defaultGWs": "",\n "redirectToHostRules": null\n}"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_HOOK_VPP_RUNNING=#!/bin/sh\n\nHOOK="$0"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; fixing dns..."\n sed -i "s/\[main\]/\[main\]\ndns=none/" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; undoing dns fix..."\n sed -i "0,/dns=none/{/dns=none/d;}" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo "default_hook: system is using systemd-networkd; restarting..."\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; restarting..."\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo "default_hook: system is using networking service; restarting..."\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo "default_hook: system is using network service; restarting..."\n systemctl restart network\n else\n echo "default_hook: Networking backend not detected, network configuration may fail"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo "default_hook: using systemctl..."\nelse\n echo "default_hook: Init system not supported, network configuration may fail"\n exit 1\nfi\n\nif [ "$HOOK" = "BEFORE_VPP_RUN" ]; then\n fix_dns\nelif [ "$HOOK" = "VPP_RUNNING" ]; then\n restart_network\nelif [ "$HOOK" = "VPP_DONE_OK" ]; then\n undo_dns_fix\n restart_network\nelif [ "$HOOK" = "VPP_ERRORED" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_HOOK_VPP_DONE_OK=#!/bin/sh\n\nHOOK="$0"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; fixing dns..."\n sed -i "s/\[main\]/\[main\]\ndns=none/" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; undoing dns fix..."\n sed -i "0,/dns=none/{/dns=none/d;}" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo "default_hook: system is using systemd-networkd; restarting..."\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; restarting..."\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo "default_hook: system is using networking service; restarting..."\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo "default_hook: system is using network service; restarting..."\n systemctl restart network\n else\n echo "default_hook: Networking backend not detected, network configuration may fail"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo "default_hook: using systemctl..."\nelse\n echo "default_hook: Init system not supported, network configuration may fail"\n exit 1\nfi\n\nif [ "$HOOK" = "BEFORE_VPP_RUN" ]; then\n fix_dns\nelif [ "$HOOK" = "VPP_RUNNING" ]; then\n restart_network\nelif [ "$HOOK" = "VPP_DONE_OK" ]; then\n undo_dns_fix\n restart_network\nelif [ "$HOOK" = "VPP_ERRORED" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_LOG_LEVEL=info"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_BGP_LOG_LEVEL=INFO"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_GRACEFUL_SHUTDOWN_TIMEOUT=10s"
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_INTERFACE="
time="2024-04-01T02:51:37Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_VPP_RUN=#!/bin/sh\n\nHOOK="$0"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; fixing dns..."\n sed -i "s/\[main\]/\[main\]\ndns=none/" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; undoing dns fix..."\n sed -i "0,/dns=none/{/dns=none/d;}" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo "default_hook: system is using systemd-networkd; restarting..."\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo "default_hook: system is using NetworkManager; restarting..."\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo "default_hook: system is using networking service; restarting..."\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo "default_hook: system is using network service; restarting..."\n systemctl restart network\n else\n echo "default_hook: Networking backend not detected, network configuration may fail"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo "default_hook: using systemctl..."\nelse\n echo "default_hook: Init system not supported, network configuration may fail"\n exit 1\nfi\n\nif [ "$HOOK" = "BEFORE_VPP_RUN" ]; then\n fix_dns\nelif [ "$HOOK" = "VPP_RUNNING" ]; then\n restart_network\nelif [ "$HOOK" = "VPP_DONE_OK" ]; then\n undo_dns_fix\n restart_network\nelif [ "$HOOK" = "VPP_ERRORED" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-01T02:51:37Z" level=info msg="-- Environment --"
time="2024-04-01T02:51:37Z" level=info msg="Hugepages 512"
time="2024-04-01T02:51:37Z" level=info msg="KernelVersion 4.15.0-212"
time="2024-04-01T02:51:37Z" level=info msg="Drivers map[uio_pci_generic:false vfio-pci:true]"
time="2024-04-01T02:51:37Z" level=info msg="initial iommu status N"
time="2024-04-01T02:51:37Z" level=info msg="-- Interface Spec --"
time="2024-04-01T02:51:37Z" level=info msg="Interface Name: enp0s8"
time="2024-04-01T02:51:37Z" level=info msg="Native Driver: "
time="2024-04-01T02:51:37Z" level=info msg="New Drive Name: "
time="2024-04-01T02:51:37Z" level=info msg="PHY target #Queues rx:0 tx:0"
time="2024-04-01T02:51:37Z" level=info msg="Tap MTU: 0"
time="2024-04-01T02:51:37Z" level=info msg="-- Interface config --"
time="2024-04-01T02:51:37Z" level=info msg="Node IP4: 192.168.56.2/24"
time="2024-04-01T02:51:37Z" level=info msg="Node IP6: "
time="2024-04-01T02:51:37Z" level=info msg="PciId: 0000:00:08.0"
time="2024-04-01T02:51:37Z" level=info msg="Driver: e1000"
time="2024-04-01T02:51:37Z" level=info msg="Linux IF was up ? true"
time="2024-04-01T02:51:37Z" level=info msg="Promisc was on ? true"
time="2024-04-01T02:51:37Z" level=info msg="DoSwapDriver: false"
time="2024-04-01T02:51:37Z" level=info msg="Mac: 08:00:27:10:1e:ec"
time="2024-04-01T02:51:37Z" level=info msg="Addresses: [192.168.56.2/24 enp0s8,fe80::a00:27ff:fe10:1eec/64]"
time="2024-04-01T02:51:37Z" level=info msg="Routes: [{Ifindex: 3 Dst: fe80::/64 Src: Gw: Flags: [] Table: 254 Realm: 0}, {Ifindex: 3 Dst: 192.168.56.0/24 Src: 192.168.56.2 Gw: Flags: [] Table: 254 Realm: 0}]"
time="2024-04-01T02:51:37Z" level=info msg="PHY original #Queues rx:1 tx:1"
time="2024-04-01T02:51:37Z" level=info msg="MTU 1500"
time="2024-04-01T02:51:37Z" level=info msg="isTunTap false"
time="2024-04-01T02:51:37Z" level=info msg="isVeth false"
time="2024-04-01T02:51:37Z" level=info msg="Running with uplink af_packet"
default_hook: using systemctl...
default_hook: using systemctl...
time="2024-04-01T02:51:37Z" level=info msg="VPP started [PID 28433]"
time="2024-04-01T02:51:38Z" level=info msg="Waiting for VPP... [0/10]"
vpp[28433]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
vpp[28433]: tls_init_ca_chain:1086: Could not initialize TLS CA certificates
vpp[28433]: tls_openssl_init:1209: failed to initialize TLS CA chain
vpp[28433]: vat-plug/load: vat_plugin_register: idpf plugin not loaded...
vpp[28433]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
time="2024-04-01T02:51:41Z" level=info msg="Created AF_PACKET interface 1"
time="2024-04-01T02:51:41Z" level=info msg="tagging interface [1] with: main-enp0s8"
time="2024-04-01T02:51:41Z" level=info msg="Adding address 192.168.56.2/24 enp0s8 to uplink interface"
time="2024-04-01T02:51:41Z" level=info msg="Not adding address fe80::a00:27ff:fe10:1eec/64 to uplink interface (vpp requires /128 link-local)"
time="2024-04-01T02:51:41Z" level=info msg="Creating Linux side interface"
time="2024-04-01T02:51:41Z" level=info msg="Adding address 192.168.56.2/24 enp0s8 to tap interface"
time="2024-04-01T02:51:41Z" level=info msg="Not adding address fe80::a00:27ff:fe10:1eec/64 to data interface (vpp requires /128 link-local)"
time="2024-04-01T02:51:41Z" level=info msg="Adding ND proxy for address fe80::a00:27ff:fe10:1eec"
time="2024-04-01T02:51:41Z" level=info msg="Adding address 192.168.56.2/24 enp0s8 to tap interface"
time="2024-04-01T02:51:41Z" level=info msg="Adding address fe80::a00:27ff:fe10:1eec/64 to tap interface"
time="2024-04-01T02:51:41Z" level=warning msg="add addr fe80::a00:27ff:fe10:1eec/64 via vpp EEXIST, file exists"
time="2024-04-01T02:51:41Z" level=info msg="Adding route {Ifindex: 19 Dst: fe80::/64 Src: Gw: Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-01T02:51:41Z" level=info msg="add route via vpp : {Ifindex: 19 Dst: fe80::/64 Src: Gw: Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-01T02:51:41Z" level=info msg="Adding route {Ifindex: 19 Dst: 192.168.56.0/24 Src: 192.168.56.2 Gw: Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-01T02:51:41Z" level=info msg="add route via vpp : {Ifindex: 19 Dst: 192.168.56.0/24 Src: 192.168.56.2 Gw: Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-01T02:51:41Z" level=info msg="Using 192.168.56.254 as next hop for cluster IPv4 routes"
time="2024-04-01T02:51:41Z" level=info msg="Setting BGP nodeIP 192.168.56.2/24"
time="2024-04-01T02:51:41Z" level=info msg="Updating node, version = 1103, metaversion = 1103"
default_hook: using systemctl...
default_hook: system is using systemd-networkd; restarting...
time="2024-04-01T02:51:42Z" level=info msg="Received signal child exited, vpp index 1"
time="2024-04-01T02:51:42Z" level=info msg="Ignoring SIGCHLD for pid 0"
time="2024-04-01T02:51:42Z" level=info msg="Done with signal child exited"

@onong
Copy link
Collaborator

onong commented Apr 1, 2024

You have enp0s8 and enp0s9 configured with ip addrs in the same subnet, 192.168.56.0/24. That might be causing confusion in the routing. Bringing down enp0s9 might be a good idea among other things.

Secondly, the --pod-network-cidr=192.168.0.0/16 overlaps with the subnet used by enp0s8/9. Pls use a different cidr.

@umarfarooq-git
Copy link
Author

@onong After spending hours, I got it working. Thank you very much.

Solution:

  • bring down the enp0s9
  • updated cidr as --pod-network-cidr=10.244.0.0/16

I have another question which isn't directly aligned with the issue, Would be greatful for your response.

Does overall Calico VPP (particularly VPP's driver DPDK) works with kubevirt If we want to accelerate network traffic of VM's inside K8S cluster..!
Unfortunately unable to find any related docs.

@onong
Copy link
Collaborator

onong commented Apr 4, 2024

Could you elaborate on what you mean by "we want to accelerate network traffic of VM's inside K8S cluster."?

@ivansharamok
Copy link

ivansharamok commented Apr 5, 2024

I hit a similar issue in my K8s cluster that uses Azure Compute instances and CentOS 8. I have NetworkManager managing networking on the hosts.

Environment:

  • Calico/VPP version: operator v3.27.2 / VPP v3.27.0
  • Kubernetes version: v1.28.8
  • networking on the hosts managed by NetworkManager
  • host OS: CentOS 8
  • Infra: Azure Compute instances with a single eth0 interface

When I deploy a netshoot pod right after I applied Calico installation and vpp manifests, the pod gets networked once calico-node came up and I get full DNS resolution from within that pod. However, any pods that I deploy after Calico VPP is fully initialized can't seem to reach the kube DNS service.

Here's what I get from a netshoot pod that was deployed before Calico VPP was fully initialized:

<<K9s-Shell>> Pod: default/netshoot | Container: netshoot
netshoot:~# nslookup kuberenetes
Server:		10.96.0.10
Address:	10.96.0.10#53

** server can't find kuberenetes: NXDOMAIN

netshoot:~# nslookup nginx-svc.uat
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	nginx-svc.uat.svc.cluster.local
Address: 10.103.29.82

netshoot:~# nslookup google.com
Server:		10.96.0.10
Address:	10.96.0.10#53

Non-authoritative answer:
Name:	google.com
Address: 142.251.211.238
Name:	google.com
Address: 2607:f8b0:400a:80b::200e

netshoot:~# nc -zvw2 10.96.0.10 53
Connection to 10.96.0.10 53 port [tcp/domain] succeeded!

Here's what I get from a netshoot2 pod that was deployed after Calico VPP was fully initialized:

<<K9s-Shell>> Pod: default/netshoot2 | Container: netshoot2
netshoot2:~# nslookup kuberentes
;; communications error to 10.96.0.10#53: timed out

netshoot2:~# nslookup nginx-svc.uat
;; communications error to 10.96.0.10#53: timed out

netshoot2:~# nslookup google.com
;; communications error to 10.96.0.10#53: timed out

netshoot2:~# nc -zvw2 10.96.0.10 53
nc: connect to 10.96.0.10 port 53 (tcp) timed out: Operation in progress

I restarted one of my coredns pods, and I see these logs in it

[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout

Here's the NetworkManager logs from one of the hosts

-- Logs begin at Fri 2024-04-05 17:04:39 UTC, end at Fri 2024-04-05 18:01:59 UTC. --
Apr 05 17:04:59 master systemd[1]: Starting Network Manager...
Apr 05 17:04:59 master NetworkManager[922]: <info>  [1712336699.9329] NetworkManager (version 1.32.10-4.el8) is starting... (for the first time)
Apr 05 17:04:59 master NetworkManager[922]: <info>  [1712336699.9335] Read config: /etc/NetworkManager/NetworkManager.conf (etc: 99-dhcp-timeout.conf)
Apr 05 17:04:59 master systemd[1]: Started Network Manager.
Apr 05 17:04:59 master NetworkManager[922]: <info>  [1712336699.9658] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Apr 05 17:04:59 master NetworkManager[922]: <info>  [1712336699.9984] manager[0x5585b1af5040]: monitoring kernel firmware directory '/lib/firmware'.
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0032] hostname: hostname: using hostnamed
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0032] hostname: hostname changed from (none) to "master"
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0037] dns-mgr[0x5585b1ad8250]: init: dns=default,systemd-resolved rc-manager=symlink
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0149] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-device-plugin-team.so)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0150] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0152] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0153] manager: Networking is enabled by state file
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0162] dhcp-init: Using DHCP client 'internal'
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0176] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-settings-plugin-ifcfg-rh.so")
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0176] settings: Loaded settings plugin: keyfile (internal)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0279] device (lo): carrier: link connected
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0326] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0416] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0480] device (eth0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.0986] device (eth0): carrier: link connected
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1125] device (eth0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1135] policy: auto-activating connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1142] device (eth0): Activation: starting connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1154] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1171] manager: NetworkManager state is now CONNECTING
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1180] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1199] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1214] dhcp4 (eth0): activation: beginning transaction (timeout in 300 seconds)
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1560] dhcp4 (eth0): state changed unknown -> bound, address=172.10.1.5
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1577] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1742] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1746] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1752] manager: NetworkManager state is now CONNECTED_LOCAL
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1757] manager: NetworkManager state is now CONNECTED_SITE
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1758] policy: set 'System eth0' (eth0) as default for IPv4 routing and DNS
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1771] device (eth0): Activation: successful, device activated.
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1780] manager: NetworkManager state is now CONNECTED_GLOBAL
Apr 05 17:05:00 master NetworkManager[922]: <info>  [1712336700.1849] manager: startup complete
Apr 05 17:05:16 master systemd[1]: Reloading Network Manager.
Apr 05 17:05:17 master NetworkManager[922]: <info>  [1712336717.0440] audit: op="reload" arg="0" pid=1862 uid=0 result="success"
Apr 05 17:05:17 master NetworkManager[922]: <info>  [1712336717.0449] config: signal: SIGHUP (no changes from disk)
Apr 05 17:05:17 master systemd[1]: Reloaded Network Manager.
Apr 05 17:57:07 master systemd[1]: Stopping Network Manager...
Apr 05 17:57:07 master NetworkManager[922]: <info>  [1712339827.2388] caught SIGTERM, shutting down normally.
Apr 05 17:57:07 master NetworkManager[922]: <info>  [1712339827.2464] dhcp4 (eth0): canceled DHCP transaction
Apr 05 17:57:07 master NetworkManager[922]: <info>  [1712339827.2465] dhcp4 (eth0): state changed bound -> terminated
Apr 05 17:57:07 master NetworkManager[922]: <info>  [1712339827.2467] manager: NetworkManager state is now CONNECTED_SITE
Apr 05 17:57:07 master NetworkManager[922]: <info>  [1712339827.2768] exiting (success)
Apr 05 17:57:07 master systemd[1]: NetworkManager.service: Succeeded.
Apr 05 17:57:07 master systemd[1]: Stopped Network Manager.
Apr 05 17:57:07 master systemd[1]: Starting Network Manager...
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.3291] NetworkManager (version 1.32.10-4.el8) is starting... (after a restart)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.3292] Read config: /etc/NetworkManager/NetworkManager.conf (etc: 99-dhcp-timeout.conf)
Apr 05 17:57:07 master systemd[1]: Started Network Manager.
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.3312] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.3422] manager[0x5571c321d0a0]: monitoring kernel firmware directory '/lib/firmware'.
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5734] hostname: hostname: using hostnamed
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5737] hostname: hostname changed from (none) to "master"
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5741] dns-mgr[0x5571c31ff250]: init: dns=none,systemd-resolved rc-manager=unmanaged
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5784] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-device-plugin-team.so)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5785] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5785] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5787] manager: Networking is enabled by state file
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5788] dhcp-init: Using DHCP client 'internal'
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5795] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-settings-plugin-ifcfg-rh.so")
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5824] settings: Loaded settings plugin: keyfile (internal)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5856] device (lo): carrier: link connected
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5860] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5871] device (eth0): carrier: link connected
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5879] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5896] manager: (eth0): assume: will attempt to assume matching connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03) (indicated)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5897] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5905] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5938] device (eth0): Activation: starting connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5960] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5965] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5968] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.5972] dhcp4 (eth0): activation: beginning transaction (timeout in 300 seconds)
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6368] dhcp4 (eth0): state changed unknown -> bound, address=172.10.1.5
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6422] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6470] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6473] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6479] manager: NetworkManager state is now CONNECTED_LOCAL
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6487] manager: NetworkManager state is now CONNECTED_SITE
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6489] policy: set 'System eth0' (eth0) as default for IPv4 routing and DNS
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6497] device (eth0): Activation: successful, device activated.
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6506] manager: NetworkManager state is now CONNECTED_GLOBAL
Apr 05 17:57:07 master NetworkManager[40556]: <info>  [1712339827.6511] manager: startup complete
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.3861] device (eth0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.3871] dhcp4 (eth0): canceled DHCP transaction
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.3871] dhcp4 (eth0): state changed bound -> terminated
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.3889] manager: NetworkManager state is now DISCONNECTED
Apr 05 17:57:10 master NetworkManager[40556]: <warn>  [1712339830.3907] dns-sd-resolved[986b7b74fdcc1af0]: send-updates SetLinkDomains@2 failed: GDBus.Error:org.freedesktop.resolve1.NoSuchLink: Link 2 not known
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.5153] manager: (eth0): new Tun device (/org/freedesktop/NetworkManager/Devices/3)
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6193] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6240] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6249] device (eth0): Activation: starting connection 'eth0' (40c97394-9ec0-43b9-9948-67cc8534ed18)
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6271] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6274] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6277] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6278] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6317] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6321] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6330] manager: NetworkManager state is now CONNECTED_LOCAL
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6335] device (eth0): Activation: successful, device activated.
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.6342] manager: NetworkManager state is now CONNECTED_GLOBAL
Apr 05 17:57:10 master systemd[1]: Stopping Network Manager...
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.7324] caught SIGTERM, shutting down normally.
Apr 05 17:57:10 master NetworkManager[40556]: <info>  [1712339830.7339] manager: NetworkManager state is now CONNECTED_LOCAL
Apr 05 17:57:11 master NetworkManager[40556]: <info>  [1712339831.0554] exiting (success)
Apr 05 17:57:11 master systemd[1]: NetworkManager.service: Succeeded.
Apr 05 17:57:11 master systemd[1]: Stopped Network Manager.
Apr 05 17:57:11 master systemd[1]: Starting Network Manager...
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1293] NetworkManager (version 1.32.10-4.el8) is starting... (after a restart)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1294] Read config: /etc/NetworkManager/NetworkManager.conf (etc: 99-dhcp-timeout.conf)
Apr 05 17:57:11 master systemd[1]: Started Network Manager.
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1349] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1476] manager[0x5621ca5c4040]: monitoring kernel firmware directory '/lib/firmware'.
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1502] hostname: hostname: using hostnamed
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1502] hostname: hostname changed from (none) to "master"
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1507] dns-mgr[0x5621ca5a9250]: init: dns=none,systemd-resolved rc-manager=unmanaged
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1571] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-device-plugin-team.so)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1572] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1573] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1574] manager: Networking is enabled by state file
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1575] dhcp-init: Using DHCP client 'internal'
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1582] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.32.10-4.el8/libnm-settings-plugin-ifcfg-rh.so")
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1583] settings: Loaded settings plugin: keyfile (internal)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1617] device (lo): carrier: link connected
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1621] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1660] manager: (eth0): new Tun device (/org/freedesktop/NetworkManager/Devices/2)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1677] manager: (eth0): assume: will attempt to assume matching connection 'eth0' (40c97394-9ec0-43b9-9948-67cc8534ed18) (indicated)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1679] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1688] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1717] device (eth0): Activation: starting connection 'eth0' (40c97394-9ec0-43b9-9948-67cc8534ed18)
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1742] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1746] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1749] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1836] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1849] device (eth0): ipv6: duplicate address check failed for the fe80::222:48ff:febc:738e/64 lft forever pref forever lifetime 1-0[4294967295,4294967295] dev 3 flags permanent,tentative src kernel address
Apr 05 17:57:11 master NetworkManager[40674]: <warn>  [1712339831.1922] acd[0x5621ca6580f0,3]: conflict for address 172.10.1.5 detected with host 02:CA:11:C0:FD:00 on interface 'eth0'
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1924] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1928] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'assume')
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1934] manager: NetworkManager state is now CONNECTED_LOCAL
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1941] manager: NetworkManager state is now CONNECTED_SITE
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1942] policy: set 'eth0' (eth0) as default for IPv4 routing and DNS
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1947] device (eth0): Activation: successful, device activated.
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1957] manager: NetworkManager state is now CONNECTED_GLOBAL
Apr 05 17:57:11 master NetworkManager[40674]: <info>  [1712339831.1969] manager: startup complete

I tried restarting the NetworkManager on the each host and then created netshoot3 pod, but has the same behavior as netshoot2, i.e. no DNS resolution from within the pod.

I noticed that on both of my cluster hosts (master, worker1) I have these log entries in the NetworkManager logs

# on master host
Apr 05 17:57:11 master NetworkManager[40674]: <warn>  [1712339831.1922] acd[0x5621ca6580f0,3]: conflict for address 172.10.1.5 detected with host 02:CA:11:C0:FD:00 on interface 'eth0'

# on worker1 host
Apr 05 22:40:36 worker1 NetworkManager[64135]: <warn>  [1712356836.6727] acd[0x55a38f87a590,3]: conflict for address 172.10.1.4 detected with host 02:CA:11:C0:FD:00 on interface 'eth0'

Is this a typical message when VPP is taking over the interface or this could be an indicator of some other problem?

@umarfarooq-git
Copy link
Author

@onong
I want to deploy Virtual Network Functions (VNFs) on VMs inside k8s cluster by using kubevirt. These VNFs have some dependencies like DPDK and SR-IOV. I am interested to provide DPDK support to VMs using Calico VPP plugin. But not sure, If it's really possible or not...

@umarfarooq-git
Copy link
Author

umarfarooq-git commented Apr 6, 2024

@onong Can you please look into the following issue. I again stuck with while configuring Calico VPP on the same platform but this time with 2 nodes.

Setup:
1 master node
1 worker node

Details:
master node
image

worker node
image

Pod status
image

Problem1
First I was facing the same trouble, I mean CoreDNS pods and calico kube controller stuck in containercreating state.
solution:
Configured NetworkManager

Problem2
Now calico pods are running but contrlplane node like kube-controller-manager-kubemaster, kube-scheduler-kubemaster and tigera-operator-6bfc79cb9c-v2qcx run just momentry and then CrashLoopBackOff.

Tried solution but unsuccessful:
Updated the schedular and manager static pod with following.
command:

  • kube-controller-manager
  • --leader-elect=true
  • --leader-elect-lease-duration=30s
  • --leader-elect-renew-deadline=20s
    but problem still there 😞

Logs:
kube Manager
image

kube Manager
image

tigera-operator
image

  1. All the other details of K8S cluster and Calico VPP are exactly same as defined in main issue.
  2. It works perfectly fine in case of just one node cluster, without any worker node. OR in case of two nodes it works without Calico VPP

Would greatful to have any clue about the trouble which I am facing. I am really confused about the 'Leader Election' error. I don't even have two master node.

@onong
Copy link
Collaborator

onong commented Apr 8, 2024

@umarfarooq-git The logs seem to indicate that the apiserver is not responding. Could you share the apiserver logs?

@onong
Copy link
Collaborator

onong commented Apr 8, 2024

@onong I want to deploy Virtual Network Functions (VNFs) on VMs inside k8s cluster by using kubevirt. These VNFs have some dependencies like DPDK and SR-IOV. I am interested to provide DPDK support to VMs using Calico VPP plugin. But not sure, If it's really possible or not...

Just so we are on the same page, in a Calico VPP cluster, the main/uplink interface is consumed by VPP using one of the supported uplink drivers (af_packet, DPDK etc) and the pods are presented with a tuntap interface.

CalicoVPP

By configuring Calico VPP to use DPDK (assuming that the NIC is DPDK supported), your pods/VNFs are indirectly using DPDK. But I guess thats not what you may have in mind :)

So, could you describe what's your usage scenario with the VNF and DPDK and SR-IOV in the above framework? Are you looking to setup another NIC for the VNF to be consumed by DPDK?

@onong
Copy link
Collaborator

onong commented Apr 8, 2024

on master host

Apr 05 17:57:11 master NetworkManager[40674]: [1712339831.1922] acd[0x5621ca6580f0,3]: conflict for address 172.10.1.5 detected with host 02:CA:11:C0:FD:00 on interface 'eth0'

on worker1 host

Apr 05 22:40:36 worker1 NetworkManager[64135]: [1712356836.6727] acd[0x55a38f87a590,3]: conflict for address 172.10.1.4 detected with host 02:CA:11:C0:FD:00 on interface 'eth0'

@ivansharamok, host with MAC 02:CA:11:C0:FD:00 seems to be causing the conflict on both the master and worker. DHCP misconfiguration? Maybe find the culprit host 02:CA:11:C0:FD:00 and shut it down?

@ivansharamok
Copy link

host with MAC 02:CA:11:C0:FD:00 seems to be causing the conflict on both the master and worker. DHCP misconfiguration? Maybe find the culprit host 02:CA:11:C0:FD:00 and shut it down?

There is no host with MAC 02:CA:11:C0:FD:00 in my setup. Below is ifconfig output from 2 hosts which I'm using to test Calico VPP in my cluster.

# master host ifconfig output
[azureuser@master ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.10.1.4  netmask 255.255.255.0  broadcast 172.10.1.255
        ether 00:22:48:b7:a1:9f  txqueuelen 1000  (Ethernet)
        RX packets 85927  bytes 647465868 (617.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 66197  bytes 57170711 (54.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 259210  bytes 147938219 (141.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 259210  bytes 147938219 (141.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# worker1 host ifconfig output
[azureuser@worker1 ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.10.1.5  netmask 255.255.255.0  broadcast 172.10.1.255
        inet6 fe80::20d:3aff:fef6:4fbe  prefixlen 64  scopeid 0x20<link>
        ether 00:0d:3a:f6:4f:be  txqueuelen 1000  (Ethernet)
        RX packets 170488  bytes 805979391 (768.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 152307  bytes 161211067 (153.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 775  bytes 138868 (135.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 775  bytes 138868 (135.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I noticed today that on worker1 host, I get this warn log entry in the NetworkManager log

Apr 08 23:30:32 worker1 NetworkManager[9055]: <warn>  [1712619032.7995] dns-sd-resolved[43db7271bfe3a803]: send-updates SetLinkDomains@2 failed: GDBus.Error:org.freedesktop.resolve1.NoSuchLink: Link 2 not known

Do you know if Calico VPP was ever tested using Azure VMs? I'm starting to suspect that Azure VMs of Standard_D4s_v3size may not like when VPP is trying to take over Azure managed primary interface on the VM.

@onong
Copy link
Collaborator

onong commented Apr 10, 2024

@ivansharamok, we have not tested with Azure VMs afaik.

There is no host with MAC 02:CA:11:C0:FD:00 in my setup. Below is ifconfig output from 2 hosts which I'm using to test Calico VPP in my cluster.

What I meant was that there might be another node (with MAC 02:CA:11:C0:FD:00) in the subnet which is assigned the IP addrs belonging to the worker/master node? Try arping and see if you get a response:

arping -I eth0 <master/worker IP addr>

@ivansharamok
Copy link

I don't have any other nodes in the subnet. In my test environment I'm building all the resources from scratch with terraform. I have a dedicated VPC with only 2 Azure Compute instances within the VPC.

# arping on master node
[azureuser@master ~]$ arping -c2 -I eth0 172.10.1.5
ARPING 172.10.1.5 from 172.10.1.4 eth0
Unicast reply from 172.10.1.5 [12:34:56:78:9A:BC]  0.864ms
Unicast reply from 172.10.1.5 [12:34:56:78:9A:BC]  0.964ms
Sent 2 probes (1 broadcast(s))
Received 2 response(s)

# arping on worker1 node
[azureuser@worker1 ~]$ arping -c2 -I eth0 172.10.1.4
ARPING 172.10.1.4 from 172.10.1.5 eth0
Unicast reply from 172.10.1.4 [12:34:56:78:9A:BC]  1.031ms
Unicast reply from 172.10.1.4 [12:34:56:78:9A:BC]  1.139ms
Sent 2 probes (1 broadcast(s))
Received 2 response(s)

@umarfarooq-git
Copy link
Author

@onong Thank you for responding.
Current issue which I was facing to setup Calico VPP got resolved and problem was the shortage of memory for the master node. Got clue from the following issue.
https://stackoverflow.com/questions/75148975/leaderelections-failing-lease-unable-to-be-renewed-automatically

@umarfarooq-git
Copy link
Author

Regarding Calico VPP support for VMs inside k8s cluster. My goal is exactly what you mentioned. I have a device which has more than one NIC (with DPDK support). I want to run a VNF (VM based network function) on such a device using kubevirt inside K8S cluster. and want to use Calico VPP with DPDK drivers in order to accelrate network traffic for that VNF through one of the available NIC.

@onong
Copy link
Collaborator

onong commented Apr 25, 2024

@umarfarooq-git you might want to go through the multinet doc and see if it matches what you are looking for?

https://github.com/projectcalico/vpp-dataplane/blob/master/docs/multinet.md

@onong
Copy link
Collaborator

onong commented Apr 25, 2024

@ivansharamok, sorry for the delayed response. The warning log msg around address conflict in the NM logs is probably ok given that azure networking differs from conventional networking somewhat. Sorry for making you chase that lead :(

However, if the network connectivity between the two nodes is ok then things should work fine. But like I mentioned earlier, we have not tested on azure so can't say for sure. The issue you are seeing is probably due to some quirk in azure networking.

@onong onong self-assigned this Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants