xunleii / terraform-module-k3s

Terraform module to manage a k3s cluster on given machines
https://registry.terraform.io/modules/xunleii/k3s/module
MIT License
186 stars 51 forks source link

:bug: Cannot scale up server nodes #153

Closed alinalex1392 closed 8 months ago

alinalex1392 commented 11 months ago

:fire: What happened?

  1. Create a k3s cluster with only one server node, the cluster works as intended:

Config

k3s_version = "v1.27.5+k3s1"
k3s_install_env_vars = {}
drain_timeout = "300s"
private_key_path = "/home/coder/.ssh/id_rsa"
cidr = {
    pods = "10.0.0.0/16"
    services = "10.1.0.0/16"
}
managed_fields = ["label", "taint"]
servers = {
    # The node name will be automatically provided by
    # the module using the field name... any usage of
    # --node-name in additional_flags will be ignored
    node-1 = {
        ip = "10.195.64.82" // internal node IP
        connection = {
            host = "10.195.64.82" // public node IP
            user = "ubuntu"
        }
        flags = ["--tls-san cluster.local", "--write-kubeconfig-mode '0644'", "--disable-network-policy"]
        labels = {"node.kubernetes.io/type" = "master"}
    }
}
  1. Try to scale the server nodes to 3 with the following config:
k3s_version = "v1.27.5+k3s1"
k3s_install_env_vars = {}
drain_timeout = "300s"
private_key_path = "/home/coder/.ssh/id_rsa"
cidr = {
    pods = "10.0.0.0/16"
    services = "10.1.0.0/16"
}
managed_fields = ["label", "taint"]
servers = {
    # The node name will be automatically provided by
    # the module using the field name... any usage of
    # --node-name in additional_flags will be ignored
    node-1 = {
        ip = "10.195.64.82" // internal node IP
        connection = {
            host = "10.195.64.82" // public node IP
            user = "ubuntu"
        }
        flags = ["--tls-san cluster.local", "--write-kubeconfig-mode '0644'", "--disable-network-policy"]
        labels = {"node.kubernetes.io/type" = "master"}
    },
    node-2 = {
        ip = "10.195.64.223" // internal node IP
        connection = {
            host = "10.195.64.223" // public node IP
            user = "ubuntu"
        }
        flags = ["--tls-san cluster.local", "--write-kubeconfig-mode '0644'", "--disable-network-policy"]
        labels = {"node.kubernetes.io/type" = "master"}
    },
    node-3 = {
        ip = "10.195.64.208" // internal node IP
        connection = {
            host = "10.195.64.208" // public node IP
            user = "ubuntu"
        }
        flags = ["--tls-san cluster.local", "--write-kubeconfig-mode '0644'", "--disable-network-policy"]
        labels = {"node.kubernetes.io/type" = "master"}
    }
}

The logs on node-2 and node-3 have the following error, not being able to join the initial cluster:

journalctl -xef

Nov 14 08:17:29 ip-10-195-64-103 sh[24077]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Nov 14 08:17:29 ip-10-195-64-103 sh[24078]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Nov 14 08:17:30 ip-10-195-64-103 k3s[24081]: time="2023-11-14T08:17:30Z" level=info msg="Starting k3s v1.27.5+k3s1 (8d074ecb)"
Nov 14 08:17:30 ip-10-195-64-103 k3s[24081]: time="2023-11-14T08:17:30Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Nov 14 08:17:30 ip-10-195-64-103 k3s[24081]: time="2023-11-14T08:17:30Z" level=info msg="Managed etcd cluster not yet initialized"
Nov 14 08:17:30 ip-10-195-64-103 k3s[24081]: time="2023-11-14T08:17:30Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Nov 14 08:17:30 ip-10-195-64-103 k3s[24081]: time="2023-11-14T08:17:30Z" level=fatal msg="starting kubernetes: preparing server: https://10.195.64.124:6443/v1-k3s/server-bootstrap: 400 Bad Request"
Nov 14 08:17:30 ip-10-195-64-103 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE

:+1: What did you expect to happen?

The new servers should have joined the existing cluster.

:mag: How can we reproduce the issue?

The steps for reproducing are in the first section

:wrench: Module version

"3.3.0"

:wrench: Terraform version

1.6.3

:wrench: Terraform providers

Providers required by configuration:
.
├── provider[registry.terraform.io/hashicorp/null] 3.2.1
└── module.k3s
    ├── provider[registry.terraform.io/hashicorp/null] ~> 3.0
    ├── provider[registry.terraform.io/hashicorp/random] ~> 3.0
    ├── provider[registry.terraform.io/hashicorp/tls] ~> 4.0
    └── provider[registry.terraform.io/hashicorp/http] ~> 3.0

:clipboard: Additional information

No response

xunleii commented 11 months ago

Hi @alinalex1392, thanks for your issue. I've didn't seen that before, but we need to add the CA hash inside the token definition (https://docs.k3s.io/cli/token#secure).

I need more time to see how I can adopt a migration path for this case; it will break any installation before this feature and all TLS certificates must be generated from this module (they can't be optional).

I tag your issue for the v4 release.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If the issue still persists, please leave a comment and it will be reopened.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If the issue still persists, please leave a comment and it will be reopened.