yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.59k stars 534 forks source link

[BUG] 在Centos 7通过ocboot部署高可用报错:The error was: 'onecloud_version' is undefined #18357

Closed chenjacken closed 1 year ago

chenjacken commented 1 year ago

1,版本: 操作系统版本:

[root@master1 ocboot]# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

cloudpod版本:v3.10.6

2,进行高可用部署: ./ocboot.py install ./config-k8s-ha.yml,报错:

TASK [utils/kernel-check : Define kernel regex] ********************************
ok: [172.16.1.10]
ok: [172.16.1.9]

TASK [utils/kernel-check : test if nbd is supported] ***************************
changed: [172.16.1.10]
changed: [172.16.1.9]

TASK [utils/kernel-check : nbd facts] ******************************************
skipping: [172.16.1.10]
skipping: [172.16.1.9]

TASK [utils/kernel-check : Is Cloud kernel running] ****************************
ok: [172.16.1.10]
ok: [172.16.1.9]

TASK [utils/kernel-check : Is cloud kernel installed] **************************
ok: [172.16.1.10]
ok: [172.16.1.9]

TASK [utils/kernel-check : install customized kernel] **************************
included: /opt/ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml for 172.16.1.10, 172.16.1.9

TASK [utils/kernel-check : version test] ***************************************
fatal: [172.16.1.10]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'onecloud_version' is undefined\n\nThe error appears to be in '/opt/ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# This role contains common plays that will run on all nodes\n- name: version test\n  ^ here\n"}
fatal: [172.16.1.9]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'onecloud_version' is undefined\n\nThe error appears to be in '/opt/ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# This role contains common plays that will run on all nodes\n- name: version test\n  ^ here\n"}

PLAY RECAP *********************************************************************
172.16.1.10                : ok=171  changed=30   unreachable=0    failed=1    skipped=61   rescued=0    ignored=0
172.16.1.8                 : ok=123  changed=18   unreachable=0    failed=0    skipped=33   rescued=0    ignored=0
172.16.1.9                 : ok=168  changed=29   unreachable=0    failed=1    skipped=60   rescued=0    ignored=0

[root@master1 ocboot]# 
zexi commented 1 year ago

@chenjacken 这个问题我昨天修复了,拉取最新的 ocboot release/3.10 代码再重新执行试试?

zexi commented 1 year ago

@chenjacken 麻烦把 ./config-k8s-ha.yml 里面的内容贴下

chenjacken commented 1 year ago

18166 看过这个,还是有问题。

config-k8s-ha.yml内容,参考:https://www.cloudpods.org/zh/docs/setup/ha-ce/

primary_master_node:
  hostname: 172.16.1.8
  use_local: false
  user: root
  onecloud_version: "v3.10.6"
  db_host: 172.16.1.99
  db_user: "root"
  db_password: "hwyDB_@2024"
  db_port: "3306"
  skip_docker_config: true
  image_repository: registry.cn-guangzhou.aliyuncs.com/createview
  ha_using_local_registry: false
  node_ip: "172.16.1.8"
  ip_autodetection_method: "can-reach=172.16.1.8"
  controlplane_host: 172.16.1.100
  controlplane_port: "6443"
  as_host: true
  high_availability: true
  use_ee: false
  enable_minio: true
  registry_mirrors:
  - https://lje6zxpk.mirror.aliyuncs.com
  insecure_registries:
  - 172.16.1.8:5000
  host_networks: "eno1/br0/172.16.1.8"

master_nodes:
  controlplane_host: 172.16.1.100
  controlplane_port: "6443"
  as_controller: true
  as_host: true
  ntpd_server: "172.16.1.8"
  registry_mirrors:
  - https://lje6zxpk.mirror.aliyuncs.com
  high_availability: true
  hosts:
  - user: root
    hostname: "172.16.1.9"
    host_networks: "eno1/br0/172.16.1.9"
  - user: root
    hostname: "172.16.1.10"
    host_networks: "eno1/br0/172.16.1.10"
chenjacken commented 1 year ago

@chenjacken 这个问题我昨天修复了,拉取最新的 ocboot release/3.10 代码再重新执行试试?

就是刚才拉最新的版本

# 下载 ocboot 工具到本地
$ git clone -b release/3.10 https://github.com/yunionio/ocboot && cd ./ocboot
hoganlxj commented 1 year ago

@chenjacken 看下ansible的版本

zhasm commented 1 year ago

@chenjacken 看起来像是yaml的配置问题。 请贴一下 ./config-k8s-ha.yml 内容,(删掉 ip、密码等)

chenjacken commented 1 year ago

@chenjacken 看下ansible的版本

根据官方文档安装:

# 本地安装 ansible 和 git
$ yum install -y epel-release git python3-pip
$ python3 -m pip install --upgrade pip setuptools wheel
$ python3 -m pip install --upgrade ansible

ansible的版本:

[root@master1 ~]# ansible --version
[DEPRECATION WARNING]: Ansible will require Python 3.8 or newer on the controller starting with Ansible 2.12. 
Current version: 3.6.8 (default, Jun 20 2023, 11:53:23) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]. This feature 
will be removed from ansible-core in version 2.12. Deprecation warnings can be disabled by setting 
deprecation_warnings=False in ansible.cfg.
/usr/local/lib/python3.6/site-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6.
  from cryptography.exceptions import InvalidSignature
ansible [core 2.11.12] 
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.6.8 (default, Jun 20 2023, 11:53:23) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
  jinja version = 3.0.3
  libyaml = True
chenjacken commented 1 year ago

@chenjacken 看起来像是yaml的配置问题。 请贴一下 ./config-k8s-ha.yml 内容,(删掉 ip、密码等) 上面有贴出来。 https://github.com/yunionio/cloudpods/issues/18357#issuecomment-1769804759

zexi commented 1 year ago

@chenjacken 刚提交代码修复了,再用最新的 ocboot release/3.10 分支代码测试一下。

chenjacken commented 1 year ago

谢谢!

我再测试下。

zexi commented 1 year ago

@chenjacken https://github.com/yunionio/ocboot/pull/990/files 刚才还解决了一个语法问题,如果遇到报错再更新下代码

chenjacken commented 1 year ago

已经没遇到这个问题了

RUNNING HANDLER [utils/config-network-manager : Reload NetworkManager] *********
changed: [172.16.1.10]
changed: [172.16.1.9]
changed: [172.16.1.8]

RUNNING HANDLER [utils/config-network-manager : Remove immutable flag on /etc/resolv.conf] ***
changed: [172.16.1.10]
changed: [172.16.1.9]
changed: [172.16.1.8]
[WARNING]: Could not match supplied host pattern, ignoring: mariadb_node

PLAY [mariadb_node] ************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: mariadb_ha_nodes

PLAY [mariadb_ha_nodes] ********************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: clickhouse_node

PLAY [clickhouse_node] *********************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: registry_node

PLAY [registry_node] ***********************************************************
skipping: no hosts matched

PLAY [primary_master_node] *****************************************************

可以通过脚本直接部署数据库吗? https://www.cloudpods.org/zh/docs/setup/ha-ce/ 这个文档是需要手工部署高可用数据库。 https://github.com/yunionio/ocboot/blob/release/3.10/README.md 这里说明可以通过配置好,直接脚本安装数据库高可用。

chenjacken commented 1 year ago

是不是安装脚本没执行安装keepalived nc,脚本安装完之后VIP不生效了 手工进行安装yum install -y keepalived nc,然后重启,VIP可以访问了。

zexi commented 1 year ago

https://www.cloudpods.org/zh/docs/setup/ha-ce/

@chenjacken 是可以通过脚本部署高可用 mariadb ,但目前这个是双主模式的部署,已经不是 mariadb 官方推荐的方式了。之后我们计划改成3节点集群模式的,所以没有写到 cloudpods.org 文档里面,这个高可用数据库我们还是建议用户自己管理维护。

zexi commented 1 year ago

是不是安装脚本没执行安装keepalived nc,脚本安装完之后VIP不生效了 手工进行安装yum install -y keepalived nc,然后重启,VIP可以访问了。

@chenjacken keepalived 是启动在容器里面的,不需要额外安装,可以看下每个节点 docker ps -a | grep keepalived 容器的日志

chenjacken commented 1 year ago

是不是安装脚本没执行安装keepalived nc,脚本安装完之后VIP不生效了 手工进行安装yum install -y keepalived nc,然后重启,VIP可以访问了。

@chenjacken keepalived 是启动在容器里面的,不需要额外安装,可以看下每个节点 docker ps -a | grep keepalived 容器的日志

明白了。谢谢!

chenjacken commented 1 year ago

高可用部署完后 ,添加计算节点报错:

命令./ocboot.py add-node 172.16.1.8 172.16.1.5,错误信息:

TASK [utils/kernel-check : Is Cloud kernel running] ****************************
ok: [172.16.1.5]

TASK [utils/kernel-check : Is cloud kernel installed] **************************
ok: [172.16.1.5]

TASK [utils/kernel-check : install customized kernel] **************************
included: /opt/hwcloud-ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml for 172.16.1.5

TASK [utils/kernel-check : version test] ***************************************
fatal: [172.16.1.5]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible_python_interpreter' is undefined\n\nThe error appears to be in '/opt/hwcloud-ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# This role contains common plays that will run on all nodes\n- name: version test\n  ^ here\n"}

RUNNING HANDLER [utils/config-network-manager : Reload NetworkManager] *********

PLAY RECAP *********************************************************************
172.16.1.5                 : ok=99   changed=45   unreachable=0    failed=1    skipped=27   rescued=0    ignored=0

谢谢! @zexi @zhasm

zhasm commented 1 year ago

@chenjacken 多谢,我看一下 高可用部署完后 ,添加计算节点报错:undefined variable. The error was: 'ansible_python_interpreter' is undefined 的问题。

命令./ocboot.py add-node 172.16.1.8 172.16.1.5,错误信息:

TASK [utils/kernel-check : Is Cloud kernel running] ****************************
ok: [172.16.1.5]

TASK [utils/kernel-check : Is cloud kernel installed] **************************
ok: [172.16.1.5]

TASK [utils/kernel-check : install customized kernel] **************************
included: /opt/hwcloud-ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml for 172.16.1.5

TASK [utils/kernel-check : version test] ***************************************
fatal: [172.16.1.5]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible_python_interpreter' is undefined\n\nThe error appears to be in '/opt/hwcloud-ocboot/onecloud/roles/utils/kernel-check/tasks/centos-x86_64.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# This role contains common plays that will run on all nodes\n- name: version test\n  ^ here\n"}

RUNNING HANDLER [utils/config-network-manager : Reload NetworkManager] *********

PLAY RECAP *********************************************************************
172.16.1.5                 : ok=99   changed=45   unreachable=0    failed=1    skipped=27   rescued=0    ignored=0

谢谢! @zexi @zhasm

zhasm commented 1 year ago

@chenjacken 已经修复。请拉取最新代码。

zexi commented 1 year ago

https://github.com/yunionio/ocboot/pull/999 @chenjacken 这个尝试修复此问题,请再拉取代码试下

chenjacken commented 1 year ago

yunionio/ocboot#999 @chenjacken 这个尝试修复此问题,请再拉取代码试下

现在才看到。昨晚出现问题的场景是:我多开几个SSH窗口,同时执行添加计算节点,有一个成功添加,其他SSH窗口的就报如上的错误。然后昨晚我的解决方案是:不同时进行执行添加计算节点,一个一个执行,添加完一个再执行添加另外一个。

zhasm commented 1 year ago

@chenjacken 可以跟多个 ip,批量添加:

usage: ocboot.py add-node [-h] [--user SSH_USER] [--key-file SSH_PRIVATE_FILE]
                          [--port SSH_PORT] [--node-port SSH_NODE_PORT]
                          [--enable-host-on-vm]
                          FIRST_MASTER_HOST TARGET_NODE_HOSTS
                          [TARGET_NODE_HOSTS ...]

例如,

python3 ocboot.py add-node <ip1> <ip2> <ip3>
chenjacken commented 1 year ago

@chenjacken 可以跟多个 ip,批量添加:

usage: ocboot.py add-node [-h] [--user SSH_USER] [--key-file SSH_PRIVATE_FILE]
                          [--port SSH_PORT] [--node-port SSH_NODE_PORT]
                          [--enable-host-on-vm]
                          FIRST_MASTER_HOST TARGET_NODE_HOSTS
                          [TARGET_NODE_HOSTS ...]

例如,

python3 ocboot.py add-node <ip1> <ip2> <ip3>

明白了,谢谢!!