kubeadm搭建Kubernetes集群问题汇总
在运行kubeadm init命令时,遇到了一些问题。整理了一份问题解决方法,供参考。
问题一: kubeadm config images pull报错 pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\
问题描述
已配置镜像仓库地址为aliyun的地址,pull镜像时报错permission denied
[shirley@master k8s_install]$ kubeadm config images pull --config kubeadm.yam
failed to pull image "registry.aliyuncs.com/google_containers/kube-apiserver:from image service failed" err="rpc error: code = Unavailable desc = connecticontainerd.sock: connect: permission denied\"" image="registry.aliyuncs.com/g
time="2023-10-10T14:56:54+08:00" level=fatal msg="pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\
, error: exit status 1
To see the stack trace of this error execute with --v=5 or higher
解决方法:
参考 https:// techglimpse.com/failed- pull-image-registry-kube-apiserver/ 中方式进行排查。
1..验证网络是否配置正确,是否有
HTTP_PROXY
2. kubernetes使用crictl命令管理CRI,查看其配置文件
/etc/crictl.yaml
。初始情况下没有这个配置文件,这里建议添加这个配置,否则kubeadm init时会报其他错。
cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
EOF
3. 查看配置文件:
/etc/containerd/config.toml
,注释掉
disabled_plugins = ["cri"]
这一行。字面意思即知,这个配置是禁用CRI插件,与报错冲突。
# disabled_plugins = ["cri"]
如果
/etc/containerd/config.toml
不存在,运行如下命令生成:
containerd config default > /etc/containerd/config.toml
重启containerd
sudo systemctl restart containerd
问题解决
问题二:kubelet报错failed to run Kubelet: running with swap on is not supported,
问题描述
运行命令:
sudo kubeadm init --config kubeadm.yaml
时报错kubelet 异常。查看kubelet日志,报错
failed to run Kubelet: running with swap on is not supported, please disable swap!
[root@master ~]# journalctl -f -ukubelet
... ...
Oct 10 16:18:35 master.k8s kubelet[2079]: I1010 16:18:35.432021 2079 server.go:725] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
Oct 10 16:18:35 master.k8s kubelet[2079]: E1010 16:18:35.432363 2079 run.go:74] "command failed" err="failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename\t\t\t\tType\t\tSize\tUsed\tPriority /dev/dm-1 partition\t2097148\t0\t-2]"
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Oct 10 16:18:35 master.k8s systemd[1]: Unit kubelet.service entered failed state.
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service failed.
... ...
解决方法
# 临时
sudo swapoff -a
# 永久防止开机自动挂载swap
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
问题解决。查看kubelet状态,服务running了。
[root@master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2023-10-10 16:26:16 CST; 1min 4s ago
Docs: https://kubernetes.io/docs/
Main PID: 2532 (kubelet)
Tasks: 11
Memory: 32.1M
CGroup: /system.slice/kubelet.service
└─2532 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/l...
Oct 10 16:27:14 master.k8s kubelet[2532]: E1010 16:27:14.927100 2532 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc e....k8s.io/p
问题三:kubeadm init时报错一些配置文件已存在
问题描述
[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[sudo] password for shirley:
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
[WARNING Hostname]: hostname "node" could not be reached
[WARNING Hostname]: hostname "node": lookup node on 192.168.246.2:53: server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
解决方法:kubeadm reset
从log可以看出,一些配置文件已经存在。由于前面kubeadm运行报错了,导致init命令运行一半就退出,而配置文件已经生成。通过kubeadm reset命令撤销之前的操作,如下:
[shirley@master k8s_install]$ sudo kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1010 16:34:16.187161 2705 reset.go:120] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: Get "https://192.168.246.133:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.246.133:6443: connect: connection refused
W1010 16:34:16.187828 2705 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W1010 16:34:41.266029 2705 removeetcdmember.go:106] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.
The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[
问题四: kubeadm init报ipv4相关错误
[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
[WARNING Hostname]: hostname "node" could not be reached
[WARNING Hostname]: hostname "node": lookup node on 192.168.246.2:53: server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
解决方法
加载ipvs模块解决上述问题,命令如下。需要root权限
# 加载ipvs模块
modprobe br_netfilter
modprobe -- ip_vs
modprobe -- ip_vs_sh
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- nf_conntrack_ipv4
# 验证ip_vs模块
lsmod |grep ip_vs
ip_vs_wrr 12697 0
ip_vs_rr 12600 0
ip_vs_sh 12688 0
ip_vs 145458 6 ip_vs_rr,ip_vs_sh,ip_vs_wrr
nf_conntrack 139264 2 ip_vs,nf_conntrack_ipv4
libcrc32c 12644 3 xfs,ip_vs,nf_conntrack
# 内核文件
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward=1
vm.max_map_count=262144
# 生效并验证内核优化
sysctl -p /etc/sysctl.d/k8s.conf
问题五:kubeadm init时,kubelet 报错crictl --runtime-endpoint配置不对
kubeadm init报如下错误:
... ...
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
可以从日志看出时crictl命令运行时有问题。unix:///var/run/containerd/containerd.sock不存在。运行crictl命令,发现同样报错:
[root@master k8s_install]# crictl images list
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1010 17:19:18.816289 3832 remote_image.go:119] "ListImages with filter from image service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ImageFilter{Image:&ImageSpec{Image:list,Annotations:map[string]string{},},}"
FATA[0000] listing images: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
[
出现如上报错的原因时,crictl下载镜像时使用的是默认端点
[unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]
。这些端点废弃了,需要重新指定
containerd.sock
。后面的报错就是找不到dockershim.sock。
解决方法:修改crictl配置文件
cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
EOF
运行crictl images list命令,不再报错
[root@master ~]# crictl images list
IMAGE TAG IMAGE ID SIZE
registry.aliyuncs.com/google_containers/coredns v1.10.1 ead0a4a
问题六:报错pause镜像获取失败
问题描述及排查过程
kubeadm init
时,报错
The kubelet is not running
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
通过log提示执行命令
crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
发现没有容器在运行。查看containerd的日志,有如下报错:
[root@master ~]# journalctl -fu containerd
Oct 11 08:35:16 master.k8s containerd[1903]: time="2023-10-11T08:35:16.760026536+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-apiserver-node,Uid:a5a7c15a42701ab6c9dca630e6523936,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"registry.k8s.io/pause:3.6\": failed to pull image \"registry.k8s.io/pause:3.6\": failed to pull and unpack image \"registry.k8s.io/pause:3.6\": failed to resolve reference \"registry.k8s.io/pause:3.6\": failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp 108.177.125.82:443: connect: connection refused"
Oct 11 08:35:18 master.k8s containerd[1903]: time="2023-10-11T08:35:18.606581001+08:00" level=info msg="trying next host" error="failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp 108.177.125.82:443: connect: connection refused" host=registry.k8s.io
...
报错显示containerd拉去镜像失败。
error="failed to get sandbox image \"registry.k8s.io/pause:3.6\"
解决方法:修改containerd配置
用
containerd config dump
命令查看看当前containerd的配置:
[root@master k8s_install]# containerd config dump
... ...
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.6"
selinux_category_range = 1024
... ...
发现containerd模式的配置中使用pause的image repo为
registry.k8s.io/pause:3.6
,而本地镜像名称为:
registry.aliyuncs.com/google_containers/pause:3.9
[root@master k8s_install]# crictl images list | grep pause
registry.aliyuncs.com/google_containers/pause 3.9 e6f1816883972 322kB
运行
containerd config dump > /etc/containerd/config.toml
命令,将当前配置导出到文件,并修改
sandbox_image
配置。
## 将当前配置到处到配置文件
containerd config dump > /etc/containerd/config.toml
## 修改配置文件/etc/containerd/config.toml, 更改sandbox_image配置
[plugins]
[plugins."io.containerd.grpc.v1.cri"]