zifeo / terraform-openstack-rke2

Easily deploy a high-availability RKE2 Kubernetes cluster on OpenStack providers like Infomaniak.
https://registry.terraform.io/modules/zifeo/rke2/openstack/latest
Mozilla Public License 2.0
29 stars 18 forks source link

Cilium can't connect to k8s api-server #4

Closed UncleSamSwiss closed 1 year ago

UncleSamSwiss commented 1 year ago

First of all, thanks a lot for this awesome module!

I'm trying to set up a cluster on Infomaniak and it is up and running (I can connect with OpenLens), but cilium doesn't work. The cilium DeamonSet pods are all complaining:

level=info msg="Establishing connection to apiserver" host="https://10.43.0.1:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://10.43.0.1:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.43.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.43.0.1:443: i/o timeout" ipAddr="https://10.43.0.1:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Get \"https://10.43.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.43.0.1:443: i/o timeout" subsys=daemon

It seems very odd that it is trying to connect to :443 as the configuration says otherwise: https://github.com/zifeo/terraform-openstack-rke2/blob/eee89780269195a8f77c145fc5756ca4c3183b40/manifests/cilium.yml.tpl#L15

I tried with v1.25.3+rke2r1 and v1.26.0+rke2r2, but both have the exact same issue.

I'm using this module with the following configuration which is derived from the example. Variables are of course missing below as they contain sensitive info.

locals {
  config = <<EOF
# https://docs.rke2.io/install/install_options/server_config/

etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 20

# control-plane-resource-requests: kube-apiserver-cpu=75m,kube-apiserver-memory=128M,kube-scheduler-cpu=75m,kube-scheduler-memory=128M,kube-controller-manager-cpu=75m,kube-controller-manager-memory=128M,kube-proxy-cpu=75m,kube-proxy-memory=128M,etcd-cpu=75m,etcd-memory=128M,cloud-controller-manager-cpu=75m,cloud-controller-manager-memory=128M
  EOF
}

module "rke2" {
  source = "zifeo/rke2/openstack"

  bootstrap           = true # only on first run
  name                = "i8k-test"
  ssh_public_key_file = "${path.module}/../../id_rsa.pub"
  floating_pool       = "ext-floating1"

  # should be restricted to a secure bastion
  rules_ssh_cidr = "0.0.0.0/0"
  rules_k8s_cidr = "0.0.0.0/0"

  servers = [
    for i in range(1, 4) : {
      name = "server-${format("%03d", i)}"

      flavor_name      = "a2-ram4-disk0"
      image_name       = "Ubuntu 22.04 LTS Jammy Jellyfish"
      system_user      = "ubuntu"
      boot_volume_size = 4

      rke2_version     = "v1.25.3+rke2r1"
      rke2_volume_size = 6
      # https://docs.rke2.io/install/install_options/install_options/#configuration-file
      rke2_config = local.config
    }
  ]

  agents = [
    for i in range(1, 4) : {
      name        = "pool-${format("%03d", i)}"
      nodes_count = 1

      flavor_name      = "a4-ram8-disk0"
      image_name       = "Ubuntu 20.04 LTS Focal Fossa"
      system_user      = "ubuntu"
      boot_volume_size = 8

      rke2_version     = "v1.25.3+rke2r1"
      rke2_volume_size = 16
    }
  ]

  # enable automatically `kubectl delete node AGENT-NAME` after an agent change
  ff_autoremove_agent = true
  # rewrite kubeconfig
  ff_write_kubeconfig = false
  # deploy etcd backup
  ff_native_backup = true

  identity_endpoint     = var.openstack_auth_url
  object_store_endpoint = "s3.pub1.infomaniak.cloud"
}
zifeo commented 1 year ago

@UncleSamSwiss Thank you for the feedback. It seems like v1.25.3+rke2r1 was somehow rotten and the module had a few other issues with single-server cluster (it was originally designed for that setup but later expanded to focus high-availability). Can you retry with version v1.1.0 please?

Note that:

UncleSamSwiss commented 1 year ago

This is absolutely amazing, thanks a lot! Not only did you find the errors, but also fixed the issues and released a new version - all within less than a day.