My pet kubernetes cluster had a spontaneous outage yesterday. Well, what else is there to expect when the data center has a power failure?
I found time to look into the matter today. First of all, I checked the cluster nodes - hm, OK, alive and happy.
The kube-system namespace was not as good, and I found flannel and coredns deployments failing, already in graceful step-back.
I managed to reveal a number of issues (systemd mount units for bind mounts getting triggered too late, etc) but failed to resolve at first - even after completely purging and re-installing latest flannel (kubectl delete ... , kubectl apply ...).
Finally, kubelet logs on one of the nodes revealed the actual root cause: systemd-resolved was not running - and apparently neither enabled nor started by default!
Nevermind the greetings from the "how did this ever work!?"-department...
After adding the necessary steps to my k8s management fabric and running the task, things automagically started to settle :)
And again: it's always DNS.
Comments
Comments are closed.