Problem with k0s on bare metal

Hello,

I’m strugling with installation of k0s on bare metal servers. I did few tests with VPS without any issue. On bare metal I’m having following set issues (I guess that having same base):

  • cannot connect to corends pod using service (tested with nc -v -z -w 3 10.96.0.10 53 and looks like depends on what pod I’m redirected)
  • metrics server cannot reach metrics for some nodes (I can see a lot of messages like E1225 22:05:32.032237 1 scraper.go:140] "Failed to scrape node" err="Get \"https://135.125.x9.x3:10250/metrics/resource\": context deadline exceeded" node="fra1" and I’m not able to reach this port from node where metrics server pod is running)

I cannot see any error in konnectivity pods, k0scontroller or k0sworker. I’m getting to be out of any idea how to resolve it :frowning:

Thank You for any advice.

Here is my k0sctl config file (some sensitive data were obfuscated):

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-ovh
spec:
  hosts:
    - ssh:
        address: 1.2.3.4
        user: xxxx
        port: 22
        keyPath: xxx
      role: controller
    - ssh:
        address: 1.2.3.4
        user: xxxx
        port: 22
        keyPath: xxx
      role: worker
    - ssh:
        address: 1.2.3.4
        user: xxxx
        port: 22
        keyPath: xxx
      role: worker
    - ssh:
        address: 1.2.3.4
        user: xxxx
        port: 22
        keyPath: xxx
      role: worker
    - ssh:
        address: 1.2.3.4
        user: xxxx
        port: 22
        keyPath: xxx
      role: worker
  k0s:
    version: v1.28.4+k0s.0
    dynamicConfig: true
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        name: k0s-ovh
      spec:
        api:
          extraArgs:
            service-node-port-range: "80-32767"

So from the worker node, you cannot connect to coreDNS?

My first check would be firewalls, do you have any firewall running on the nodes?

If you have, you should allow pod and service CIDRs on the firewall. And also enable some common ports, see more at Networking (CNI) - Documentation

Yes, neither pod (having 2) or svc IP:

# kubectl -n kube-system get pods,svc -o wide -l k8s-app=kube-dns
NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE   NOMINATED NODE   READINESS GATES
pod/coredns-85df575cdb-j4nxt   1/1     Running   0          25h   10.244.3.3   fra2   <none>           <none>
pod/coredns-85df575cdb-pxdl7   1/1     Running   0          25h   10.244.1.6   fra1   <none>           <none>

NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
service/kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   25h   k8s-app=kube-dns

################################################################
# kubectl get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE   NOMINATED NODE   READINESS GATES
test-797kn   1/1     Running   0          3m34s   10.244.0.20   gra2   <none>           <none>
test-d67c6   1/1     Running   0          3m34s   10.244.1.20   fra1   <none>           <none>
test-ngz67   1/1     Running   0          3m34s   10.244.3.21   fra2   <none>           <none>
test-tkh98   1/1     Running   0          3m34s   10.244.2.22   gra1   <none>           <none>

################################################################
## problematic node:
# kubectl exec -it test-797kn -- sh
/ # nc -v -z -w 2 10.244.3.3 53
nc: 10.244.3.3 (10.244.3.3:53): Operation timed out
/ # nc -v -z -w 2 10.244.1.6 53
nc: 10.244.1.6 (10.244.1.6:53): Operation timed out
/ # nc -v -z -w 2 10.96.0.10 53
nc: 10.96.0.10 (10.96.0.10:53): Operation timed out

################################################################
## working node
# kubectl  exec -it test-d67c6 -- sh
/ # nc -v -z -w 2 10.244.3.3 53
10.244.3.3 (10.244.3.3:53) open
/ # nc -v -z -w 2 10.244.1.6 53
10.244.1.6 (10.244.1.6:53) open
/ # nc -v -z -w 2 10.96.0.10 53
10.96.0.10 (10.96.0.10:53) open

exactly same situation directly from the nodes:

vojbarz@fra1 ~> nc -v -z -w 2 10.244.3.3 53
Connection to 10.244.3.3 53 port [tcp/domain] succeeded!
vojbarz@fra1 ~> nc -v -z -w 2 10.244.1.6 53
Connection to 10.244.1.6 53 port [tcp/domain] succeeded!
vojbarz@fra1 ~> nc -v -z -w 2 10.96.0.10 53
Connection to 10.96.0.10 53 port [tcp/domain] succeeded!

###########################################
vojbarz@gra2 ~> nc -v -z -w 2 10.244.3.3 53
nc: connect to 10.244.3.3 port 53 (tcp) timed out: Operation now in progress
vojbarz@gra2 ~ [1]> nc -v -z -w 2 10.244.1.6 53
nc: connect to 10.244.1.6 port 53 (tcp) timed out: Operation now in progress
vojbarz@gra2 ~ [1]> nc -v -z -w 2 10.96.0.10 53
nc: connect to 10.96.0.10 port 53 (tcp) timed out: Operation now in progress

Just curious, does it work from node gra1? Both pods are running on nodes in the “fra” network. Maybe the issues you’re facing are connected to inter-networ-traffic between the “gra” and “fra” networks. What happens if the CoreDNS pods run on gra1 and gra2? Does this break connectivity from the “fra” nodes?

no, the same as gra2.

Looks like there is some networking issue in gra zone. Guessing based on the fact that if I move pod to gra1, I’m able to reach it only from gra1, not from gra2 neither fra nodes.

Is there a way how to findout what is wrong? Standard networking (tcp/udp) works fine between nodes.

/xref Cannot exec,logs/top to pods on some nodes · Issue #3784 · k0sproject/k0s · GitHub

Does anybody know how to find out what is not working?

Hi Vojbarzz,
Can you please verify that nodes can communicate with nodes on the other zone on port TCP/179?

If they have connectivity probably we’ll need to do a traceroute to see the way the traffic goes and see where it gets lost. This can be tricky without actually acquiring tcpdumps…

Yes, all nodes can reach others on TCP/179 (tested using netcat)

I can run tcpdumps. Can You help me with the scenario how to and what to capture?