kubeadm搭建Kubernetes集群问题汇总

Shirley

东南大学工学硕士

在运行kubeadm init命令时，遇到了一些问题。整理了一份问题解决方法，供参考。

问题一： kubeadm config images pull报错 pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\

问题描述

已配置镜像仓库地址为aliyun的地址，pull镜像时报错permission denied

[shirley@master k8s_install]$ kubeadm config images pull --config kubeadm.yam
failed to pull image "registry.aliyuncs.com/google_containers/kube-apiserver:from image service failed" err="rpc error: code = Unavailable desc = connecticontainerd.sock: connect: permission denied\"" image="registry.aliyuncs.com/g
time="2023-10-10T14:56:54+08:00" level=fatal msg="pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\
, error: exit status 1
To see the stack trace of this error execute with --v=5 or higher

解决方法：

参考 https:// techglimpse.com/failed- pull-image-registry-kube-apiserver/ 中方式进行排查。

1..验证网络是否配置正确，是否有 HTTP_PROXY

2. kubernetes使用crictl命令管理CRI，查看其配置文件 /etc/crictl.yaml 。初始情况下没有这个配置文件，这里建议添加这个配置，否则kubeadm init时会报其他错。

cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
EOF

3. 查看配置文件： /etc/containerd/config.toml ，注释掉 disabled_plugins = ["cri"] 这一行。字面意思即知，这个配置是禁用CRI插件，与报错冲突。

# disabled_plugins = ["cri"]

如果 /etc/containerd/config.toml 不存在，运行如下命令生成：

containerd config default > /etc/containerd/config.toml

重启containerd

sudo systemctl restart containerd

问题解决

问题二：kubelet报错failed to run Kubelet: running with swap on is not supported,

问题描述

运行命令： sudo kubeadm init --config kubeadm.yaml 时报错kubelet 异常。查看kubelet日志，报错 failed to run Kubelet: running with swap on is not supported, please disable swap!

[root@master ~]# journalctl -f -ukubelet
... ...
Oct 10 16:18:35 master.k8s kubelet[2079]: I1010 16:18:35.432021    2079 server.go:725] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Oct 10 16:18:35 master.k8s kubelet[2079]: E1010 16:18:35.432363    2079 run.go:74] "command failed" err="failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename\t\t\t\tType\t\tSize\tUsed\tPriority /dev/dm-1                               partition\t2097148\t0\t-2]"
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Oct 10 16:18:35 master.k8s systemd[1]: Unit kubelet.service entered failed state.
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service failed.
... ...

解决方法

# 临时
sudo swapoff -a
# 永久防止开机自动挂载swap
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

问题解决。查看kubelet状态，服务running了。

[root@master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Tue 2023-10-10 16:26:16 CST; 1min 4s ago
     Docs: https://kubernetes.io/docs/
 Main PID: 2532 (kubelet)
    Tasks: 11
   Memory: 32.1M
   CGroup: /system.slice/kubelet.service
           └─2532 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/l...
Oct 10 16:27:14 master.k8s kubelet[2532]: E1010 16:27:14.927100    2532 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc e....k8s.io/p

问题三：kubeadm init时报错一些配置文件已存在

问题描述

[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[sudo] password for shirley:
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
        [WARNING Hostname]: hostname "node" could not be reached
        [WARNING Hostname]: hostname "node": lookup node on 192.168.246.2:53: server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
        [ERROR Port-10250]: Port 10250 is in use
        [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
        [ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1

解决方法：kubeadm reset

从log可以看出，一些配置文件已经存在。由于前面kubeadm运行报错了，导致init命令运行一半就退出，而配置文件已经生成。通过kubeadm reset命令撤销之前的操作，如下：

[shirley@master k8s_install]$ sudo kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1010 16:34:16.187161    2705 reset.go:120] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: Get "https://192.168.246.133:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.246.133:6443: connect: connection refused
W1010 16:34:16.187828    2705 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W1010 16:34:41.266029    2705 removeetcdmember.go:106] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.
The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[

问题四： kubeadm init报ipv4相关错误

[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
        [WARNING Hostname]: hostname "node" could not be reached
        [WARNING Hostname]: hostname "node": lookup node on 192.168.246.2:53: server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
        [ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

解决方法

加载ipvs模块解决上述问题，命令如下。需要root权限

# 加载ipvs模块
modprobe br_netfilter
modprobe -- ip_vs
modprobe -- ip_vs_sh
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- nf_conntrack_ipv4
# 验证ip_vs模块
lsmod |grep ip_vs
ip_vs_wrr              12697  0 
ip_vs_rr               12600  0 
ip_vs_sh               12688  0 
ip_vs                 145458  6 ip_vs_rr,ip_vs_sh,ip_vs_wrr
nf_conntrack          139264  2 ip_vs,nf_conntrack_ipv4
libcrc32c              12644  3 xfs,ip_vs,nf_conntrack
# 内核文件 
cat <<EOF >  /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward=1
vm.max_map_count=262144
# 生效并验证内核优化
sysctl -p /etc/sysctl.d/k8s.conf

问题五：kubeadm init时，kubelet 报错crictl --runtime-endpoint配置不对

kubeadm init报如下错误：

... ...
This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

可以从日志看出时crictl命令运行时有问题。unix:///var/run/containerd/containerd.sock不存在。运行crictl命令，发现同样报错：

[root@master k8s_install]# crictl images list
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1010 17:19:18.816289    3832 remote_image.go:119] "ListImages with filter from image service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ImageFilter{Image:&ImageSpec{Image:list,Annotations:map[string]string{},},}"
FATA[0000] listing images: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
[

出现如上报错的原因时，crictl下载镜像时使用的是默认端点 [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock] 。这些端点废弃了，需要重新指定 containerd.sock 。后面的报错就是找不到dockershim.sock。

解决方法：修改crictl配置文件

cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
EOF

运行crictl images list命令，不再报错

[root@master ~]# crictl images list
IMAGE                                                             TAG                 IMAGE ID            SIZE
registry.aliyuncs.com/google_containers/coredns                   v1.10.1             ead0a4a

问题六：报错pause镜像获取失败

问题描述及排查过程

kubeadm init 时，报错 The kubelet is not running

Unfortunately, an error has occurred:
        timed out waiting for the condition
This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

通过log提示执行命令 crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a 发现没有容器在运行。查看containerd的日志，有如下报错：

[root@master ~]# journalctl -fu containerd
Oct 11 08:35:16 master.k8s containerd[1903]: time="2023-10-11T08:35:16.760026536+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-apiserver-node,Uid:a5a7c15a42701ab6c9dca630e6523936,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"registry.k8s.io/pause:3.6\": failed to pull image \"registry.k8s.io/pause:3.6\": failed to pull and unpack image \"registry.k8s.io/pause:3.6\": failed to resolve reference \"registry.k8s.io/pause:3.6\": failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp 108.177.125.82:443: connect: connection refused"
Oct 11 08:35:18 master.k8s containerd[1903]: time="2023-10-11T08:35:18.606581001+08:00" level=info msg="trying next host" error="failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp 108.177.125.82:443: connect: connection refused" host=registry.k8s.io
...

报错显示containerd拉去镜像失败。 error="failed to get sandbox image \"registry.k8s.io/pause:3.6\"

解决方法：修改containerd配置

用 containerd config dump 命令查看看当前containerd的配置：

[root@master k8s_install]# containerd config dump
... ...
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "registry.k8s.io/pause:3.6"
    selinux_category_range = 1024
... ...

发现containerd模式的配置中使用pause的image repo为 registry.k8s.io/pause:3.6 ，而本地镜像名称为： registry.aliyuncs.com/google_containers/pause:3.9

[root@master k8s_install]# crictl images list | grep pause
registry.aliyuncs.com/google_containers/pause                     3.9                 e6f1816883972       322kB

运行 containerd config dump > /etc/containerd/config.toml 命令，将当前配置导出到文件，并修改 sandbox_image 配置。

## 将当前配置到处到配置文件
containerd config dump > /etc/containerd/config.toml
## 修改配置文件/etc/containerd/config.toml， 更改sandbox_image配置
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]