Surface Kubernetes Conditions / Health issue in the UI

LutzLange commented 1 year ago

Every Kubernetes Node list health conditions that can indicate errors. We don't need to see the conditions all the time, but errors need to be available in the UI. I had a situation where a node was flapping between ready and not ready. I had the situation that flux wast not reconciling any more / was stuck in reconciliation.

Screenshot from 2022-11-22 11-11-55

Investigating further I found that the flux controller were not running.

Screenshot from 2022-11-22 11-23-12

Moving to the cli and inspecting node conditions, I found out why :

Events:
  Type     Reason                Age                      From     Message
  ----     ------                ----                     ----     -------
  Warning  EvictionThresholdMet  3m40s (x76037 over 57d)  kubelet  Attempting to reclaim ephemeral-storage

We should make it easier to find out what is going on in our UI. If the flux controllers are not running, there should be an error surfaced when trying to work with application from that cluster.

Node Conditions should be listed on the cluster page of each cluster :

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 21 Nov 2022 12:16:47 +0100   Mon, 21 Nov 2022 12:16:47 +0100   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Tue, 22 Nov 2022 11:26:47 +0100   Tue, 22 Nov 2022 11:26:47 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 22 Nov 2022 11:26:47 +0100   Tue, 22 Nov 2022 11:26:47 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 22 Nov 2022 11:26:47 +0100   Tue, 22 Nov 2022 11:26:47 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 22 Nov 2022 11:26:47 +0100   Tue, 22 Nov 2022 11:26:47 +0100   KubeletReady                 kubelet is posting ready status

A summary of the status is ok. MemoryPressure : DiskPressure: PIDPressure: kubeletReady: StorageInfo:

darrylweaver commented 1 year ago

Our UI is not an observability tool with enough features for enterprises. Functionality exists in tools that most organisations use such as Prometheus, Grafana and Alertmanager, which we nearly always install, but don't configure well on the management cluster. This problem wouldn't exist if we had those observability tools configured to notify us of issues. We should have a slack channel for: flux notifications alertmanager alerts

We can also populate the metadata with more dashboard links from Grafana, e.g. node health dashboard, so it is only a click away. Maybe we would want to summarise any errors we see from alertmanager in the WGE UI so they are surfaced and prompt you to click on the grafana dashboard link.

darrylweaver commented 1 year ago

This is a configuration issue of using existing tools and links to those tools instead of changing our product.

darrylweaver commented 1 year ago

closing this issue as this is really deploying a working prometheus, grafana and alertmanager that notifies us of issues on slack

weaveworks / sa-demos

Surface Kubernetes Conditions / Health issue in the UI #71