Closed LutzLange closed 1 year ago
Our UI is not an observability tool with enough features for enterprises. Functionality exists in tools that most organisations use such as Prometheus, Grafana and Alertmanager, which we nearly always install, but don't configure well on the management cluster. This problem wouldn't exist if we had those observability tools configured to notify us of issues. We should have a slack channel for: flux notifications alertmanager alerts
We can also populate the metadata with more dashboard links from Grafana, e.g. node health dashboard, so it is only a click away. Maybe we would want to summarise any errors we see from alertmanager in the WGE UI so they are surfaced and prompt you to click on the grafana dashboard link.
This is a configuration issue of using existing tools and links to those tools instead of changing our product.
closing this issue as this is really deploying a working prometheus, grafana and alertmanager that notifies us of issues on slack
Every Kubernetes Node list health conditions that can indicate errors. We don't need to see the conditions all the time, but errors need to be available in the UI. I had a situation where a node was flapping between ready and not ready. I had the situation that flux wast not reconciling any more / was stuck in reconciliation.
Investigating further I found that the flux controller were not running.
Moving to the cli and inspecting node conditions, I found out why :
We should make it easier to find out what is going on in our UI. If the flux controllers are not running, there should be an error surfaced when trying to work with application from that cluster.
Node Conditions should be listed on the cluster page of each cluster :
A summary of the status is ok. MemoryPressure : DiskPressure: PIDPressure: kubeletReady: StorageInfo: