wmenjoy / awesome-knowleges

汇总有用的知识
38 stars 7 forks source link

Rancher使用常见问题处理 #80

Open wmenjoy opened 3 years ago

wmenjoy commented 3 years ago

etcd组件

1. 诡异的 K8S 滚动更新异常

1: 重新部署后,deployment总是提示部署中,可用数为0,重新生成的为2, 服务部署成功,kubelet正常,而kube-controller-manager的提示对象不是最新版本。

现象

a. 查看kube-controller-manager的日志

I0111 17:41:09.923836       1 deployment_controller.go:484] Error syncing deployment footstone-common/bjca-deployer: Operation cannot be fulfilled on deployments.apps "bjca-deployer": the object has been modified; please apply your changes to the latest version and try again

b. describe pod 状态 MinAvailablePodn为false c. 最近一天k8s主机的包含etcd的状态失败 分析刚刚有台主机etcd挂掉,使用rancher重新接入,有可能是etcd数据状态不一致导致,停掉kube-controller-manager,然后自动重定向到其他机器,发现状态恢复

参考

  1. 三年之久的 etcd3 数据不一致 bug 分析 - 腾讯云原生 - 博客园
wmenjoy commented 3 years ago

etcd 安装问题

  1. 报错误如下
    2021-03-15 07:25:06.006959 I | embed: rejected connection from "192.168.214.32:32642" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

    原因:不同的客户端生成的配置是不一样的,ca证书可能失效,删除 /etc/kubernetes的配置,删除对应docker服务,重新生成即可

wmenjoy commented 3 years ago

Kubernetes IP修改

地址修改

  1. IP address changes in Kubernetes Master Node | by Juniarto Samsudin | Medium
wmenjoy commented 2 years ago

Cattle-System

  1. k3s集群监控(Rancher)删除之空间(namespace)cattle-system一直为Terminating状态解决方案
    
    kubectl patch namespace cattle-system -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system
    kubectl delete namespace cattle-system --grace-period=0 --force

kubectl patch namespace cattle-global-data -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system kubectl delete namespace cattle-global-data --grace-period=0 --force

kubectl patch namespace local -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system

for resource in kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get -o name -n local; do kubectl patch $resource -p '{"metadata": {"finalizers": []}}' --type='merge' -n local; done

kubectl delete namespace local --grace-period=0 --force


2. 直接使用api-server删除
- 1. 启动proxy

找一台机器

kubectl proxy --port=8081

- 2. 导出ns的json格式

ns=cattle-fleet-system kubectl get ns $ns -o json > tmp.json

- 3. 修改json
   修改spec为

"spec":{ ""}

- 4. 调用接口

ns=cattle-fleet-system curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8081/api/v1/namespaces/$ns/finalize



## 参考
1. [k3s集群监控(Rancher)删除之空间(namespace)cattle-system一直为Terminating状态解决方案_龍尐的博客-CSDN博客](https://blog.csdn.net/qq_37279279/article/details/107961464)
2. [kubernetes无法删除namespace 提示 Terminating_吕楚王的博客-CSDN博客_kubectl 删除命名空间](https://blog.csdn.net/tongzidane/article/details/88988542?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-2-88988542-blog-107213441.pc_relevant_multi_platform_whitelistv2&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-2-88988542-blog-107213441.pc_relevant_multi_platform_whitelistv2&utm_relevant_index=5)
wmenjoy commented 2 years ago

Rancher清理

!/bin/sh

docker rm -f $(docker ps -qa) docker volume rm $(docker volume ls -q) cleanupdirs="/var/lib/etcd /etc/kubernetes /etc/cni /opt/cni /var/lib/cni /var/run/calico" for dir in $cleanupdirs; do echo “Removing $dir” rm -rf $dir done

第二种方法

df -h|grep kubelet |awk -F % ‘{print $2}’|xargs umount

删除所有容器

sudo docker rm -f $(sudo docker ps -qa)

删除/var/etcd目录

sudo rm -rf /var/etcd

删除/var/lib/kubelet/目录,删除前先卸载

for m in $(sudo tac /proc/mounts | sudo awk ‘{print $2}’|sudo grep /var/lib/kubelet);do sudo umount $m||true done sudo rm -rf /var/lib/kubelet/

删除/var/lib/rancher/目录,删除前先卸载

for m in $(sudo tac /proc/mounts | sudo awk ‘{print $2}’|sudo grep /var/lib/rancher);do sudo umount $m||true done sudo rm -rf /var/lib/rancher/

删除/run/kubernetes/ 目录

sudo rm -rf /run/kubernetes/

删除所有的数据卷

sudo docker volume rm $(sudo docker volume ls -q)

再次显示所有的容器和数据卷,确保没有残留

sudo docker ps -a sudo docker volume ls

rm /var/lib/kubelet/* -rf

rm /etc/kubernetes/* -rf

rm /var/lib/rancher/* -rf

rm /var/lib/etcd/* -rf

rm /var/lib/cni/* -rf

iptables -F && iptables -t nat -F

ip link del flannel.1

docker ps -a|awk ‘{print $1}’|xargs docker rm -f

docker volume ls|awk ‘{print $2}’|xargs docker volume rm

wmenjoy commented 9 months ago

冲突

1. ipv6不支持

canal 启动失败

rancher Streaming server stopped unexpectedly: listen tcp [::1]:0: bind: cannot assign requested address

发现 /etc/hosts.conf 把locahost 设置为了 #::1 低版本对ipv6支持不好, 删除恢复了

2. 网络冲突 Calico node '192.168.126.16' is already using the IPv4 address 172.18.0.1.

Error from server (BadRequest): a container name must be specified for pod canal-4dj9f, choose one of: [install-cni flexvol-driver calico-node kube-flannel]
[rke@fs01-192-168-131-240 ~]$ kubectl -n kube-system logs canal-4dj9f calico-node
2024-02-20 09:18:56.465 [INFO][9] startup/startup.go 379: Early log level set to info
2024-02-20 09:18:56.465 [INFO][9] startup/startup.go 395: Using NODENAME environment for node name
2024-02-20 09:18:56.466 [INFO][9] startup/startup.go 407: Determined node name: 192.168.126.5
2024-02-20 09:18:56.467 [INFO][9] startup/startup.go 439: Checking datastore connection
2024-02-20 09:18:56.483 [INFO][9] startup/startup.go 463: Datastore connection verified
2024-02-20 09:18:56.484 [INFO][9] startup/startup.go 112: Datastore is ready
2024-02-20 09:18:56.510 [INFO][9] startup/startup.go 759: Using autodetected IPv4 address on interface br-daa07946aef5: 172.18.0.1/16
2024-02-20 09:18:56.510 [INFO][9] startup/startup.go 576: Node IPv4 changed, will check for conflicts
2024-02-20 09:18:56.518 [WARNING][9] startup/startup.go 1119: Calico node '192.168.126.16' is already using the IPv4 address 172.18.0.1.
2024-02-20 09:18:56.518 [WARNING][9] startup/startup.go 1331: Terminating
Calico node failed to start

不要在k8s集群上,直接运行其他的服务

3. inotify_add_watch -- failed: "No space left on device"

node数量超了 参考:https://askubuntu.com/questions/1088272/inotify-add-watch-failed-no-space-left-on-device

wmenjoy commented 9 months ago

rancher问题

1、 Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system 网络插件部署失败 清空iptables -F && iptables -F -t nat 2、x509: cannot validate certificate for x because it doesn't contain any IP SANs seen when using custom certificates 重启docker

参考

  1. https://github.com/rancher/rke/issues/2730
  2. https://github.com/rancher/rke/issues/2216
  3. https://github.com/kubernetes-sigs/metrics-server/issues/196