zifeo / terraform-openstack-rke2

Easily deploy a high-availability RKE2 Kubernetes cluster on OpenStack providers like Infomaniak.
https://registry.terraform.io/modules/zifeo/rke2/openstack/latest
Mozilla Public License 2.0
29 stars 18 forks source link

OpenStack Caracal - Deployment fails #52

Closed radumalica closed 1 month ago

radumalica commented 1 month ago

HI

I cloned the repo and modified the single server example, after a while, the master is created (bootstrap=true), 2 agents, sec groups and everything, but the k8s VIP port is down (192.168.42.4) . Floating IP is created, associated to 42.2 but nothing happens next.

I get this indefinitely

module.rke2.null_resource.write_kubeconfig[0]: Still creating... [4m40s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [4m50s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m0s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m10s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m20s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m30s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m40s elapsed] module.rke2.null_resource.write_kubeconfig[0] (local-exec): Wait rke2.yaml generation module.rke2.null_resource.write_kubeconfig[0]: Still creating... [5m50s elapsed]

Also, no LB gets created with this TF example.

ports list in the newly created network for k8s: image

Floating ip :

| created_at          | 2024-09-27T15:29:19Z                                                                                                                                                     | description         | FIP for k8s-vip (used)                                                                                                                                                   | dns_domain          |                                                                                                                                                                          | dns_name            |                                                                                                                                                                          
| fixed_ip_address    | 192.168.42.4                                                                                                                                                             
| floating_ip_address | 10.240.0.69                                                                                                                                                              
| floating_network_id | 7bc5d6db-0bdc-4e48-a595-cce7d34664ce                                                                                                                                     
| id                  | f25e62d4-b9a4-4445-8328-6f15fdb42b9d                                                                                                             
| name                | 10.240.0.69                                                                                                                                                              
| port_details        | admin_state_up='False', device_id='', device_owner='', mac_address='fa:16:3e:a1:08:1c', name='k8s-vip', network_id='2a6a7e59-d339-43f6-9f48-f09ac54fc963', status='DOWN' |
| port_id             | 408d2523-03a8-46f1-8564-4909f904776e                                                                                                                                     

As a side note, the 10.240.0.0/24 (FIP) called "external_nat_ipv4" in Openstack is reachable by the box from where i am running Terraform.

This is a fresh deployment with your github code, not an upgrade.

The openstack environment works correctly, I have production VMs on it, Load Balancers, etc. Here it seems that it might have to do with creating the dummy port :

resource "openstack_networking_port_v2" "dummy" {

Here is my complete file:

  # source = "zifeo/rke2/openstack"
  # version = ""
  source = "./../.."

  # must be true for single server cluster or
  # only on the first run for high-availability cluster
  bootstrap           = true
  name                = "k8s"
  ssh_authorized_keys = ["~/.ssh/id_rsa.pub"]
  floating_pool       = "external_nat_ipv4"
  # should be restricted to a secure bastion
  rules_ssh_cidr = "0.0.0.0/0"
  rules_k8s_cidr = "0.0.0.0/0"
  # auto load manifest form a folder (https://docs.rke2.io/advanced#auto-deploying-manifests)
  manifests_folder = "./manifests"

  servers = [{
    name = "server-a"

    flavor_name = "4vcpu-8g-ram"
    image_name  = "ubuntu-jammy-amd64"
    # if you want fixed image version
    # image_uuid       = "UUID"
    image_uuid = "ffbff00a-4e4f-4ad3-b486-c8970b8cf6f9"

    system_user      = "ubuntu"
    boot_volume_size = 6

    rke2_version     = "v1.28.4+rke2r1"
    rke2_volume_size = 10
    # https://docs.rke2.io/install/install_options/server_config/
    rke2_config = <<EOF
# https://docs.rke2.io/install/install_options/server_config/
EOF
    }
  ]

  agents = [
    {
      name        = "pool"
      nodes_count = 2

      flavor_name = "4vcpu-8g-ram"
      image_name  = "ubuntu-jammy-amd64"
      # if you want fixed image version
      # image_uuid       = "UUID"
      image_uuid = "ffbff00a-4e4f-4ad3-b486-c8970b8cf6f9"

      system_user      = "ubuntu"
      boot_volume_size = 6

      rke2_version     = "v1.28.4+rke2r1"
      rke2_volume_size = 10
    }
  ]

  backup_schedule  = "0 6 1 * *" # once a month
  backup_retention = 20

  kube_apiserver_resources = {
    requests = {
      cpu    = "75m"
      memory = "128M"
    }
  }

  kube_scheduler_resources = {
    requests = {
      cpu    = "75m"
      memory = "128M"
    }
  }

  kube_controller_manager_resources = {
    requests = {
      cpu    = "75m"
      memory = "128M"
    }
  }

  etcd_resources = {
    requests = {
      cpu    = "75m"
      memory = "128M"
    }
  }

  # enable automatically agent removal of the cluster (wait max for 30s)
  ff_autoremove_agent = "30s"
  # rewrite kubeconfig
  ff_write_kubeconfig = true
  # deploy etcd backup
  ff_native_backup = true
  # wait for the cluster to be ready when deploying
  ff_wait_ready = true

  identity_endpoint     = "https://10.35.1.150:5000/v3"
  object_store_endpoint = "s3.swift.masternode.ro"
}

output "cluster" {
  value     = module.rke2
  sensitive = true
}

variable "project" {
  type = string
}

variable "username" {
  type = string
}

variable "password" {
  type = string
}

provider "openstack" {
  tenant_name = var.project
  user_name   = var.username
  # checkov:skip=CKV_OPENSTACK_1
  password = var.password
  auth_url = "https://10.35.1.150:5000/v3"
  region   = "RegionOne"
}

terraform {
  required_version = ">= 0.14.0"

  required_providers {
    openstack = {
      source  = "terraform-provider-openstack/openstack"
      version = "~> 3.0.0"
    }
  }
}
zifeo commented 1 month ago

@radumalica thanks for the report.

  1. Can you check if your installation supports allowed-address-pairs https://docs.openstack.org/neutron/latest/admin/archives/introduction.html#allowed-address-pairs?
  2. If yes, can you please assign a free floating ip to the first server and share the logs of the following?
    crictl ps -a
    journalctl -f -u rke2-server
    cat /var/lib/rancher/rke2/agent/logs/kubelet.log
    cat /var/lib/rancher/rke2/agent/containerd/containerd.log
radumalica commented 1 month ago

Issue can be closed, there were 2 issues on my side

  1. by upgrading Openstack during the months, since Antelope you need to have enable-chassis-as-gw in ovn enabled in order for Neutron to schedule the routers to ovn chassis. I had that disabled, and my old deployed routers were working but new ones didn't.
  2. My Openstack deploys drives as /dev/vdX not /dev/sdX had to specify that as well in my main.tf

Now everything deployed correctly

Still, there are no LBs deployed, I reckon that LB will get deployed once an ingress controller is deployed on RKE2 ?

zifeo commented 1 month ago

@radumalica thanks for providing those useful details! Correct, the load-balancer will be deployed when you open a load-balancer within Kubernetes or you can provision one using IaC aside of the module, and then use the following annotations on the load-balancers to share the same one loadbalancer.openstack.org/load-balancer-id.