weaveworks / weave-gitops-enterprise

This repo provides the enterprise level features for the weave-gitops product, including CAPI cluster creation and team workspaces.
https://docs.gitops.weave.works/
Apache License 2.0
160 stars 29 forks source link

Enhace error messaging when cannot watch clusters due to RBAC #2786

Closed enekofb closed 1 year ago

enekofb commented 1 year ago

Explorer collector needs RBAC from v0.22.0 to watch leaf clusters

When this RBAC is not present the following messages show up

{"level":"error","ts":"2023-04-27T16:30:03.991Z","msg":"Failed to get API Group-Resources","error":"unknown","stacktrace":"sigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/cluster/cluster.go:161\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:351\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/collector.defaultNewWatcherManager\n\t/app/pkg/query/collector/watcher.go:158\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/collector.(*DefaultWatcher).Start\n\t/app/pkg/query/collector/watcher.go:244\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/collector.(*watchingCollector).Watch\n\t/app/pkg/query/collector/watching.go:142\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/collector.(*watchingCollector).Start\n\t/app/pkg/query/collector/watching.go:20\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/rolecollector.(*RoleCollector).Start\n\t/app/pkg/query/rolecollector/rolecollector.go:30\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/server.NewServer\n\t/app/pkg/query/server/server.go:173\ngithub.com/weaveworks/weave-gitops-enterprise/pkg/query/server.Hydrate\n\t/app/pkg/query/server/server.go:200\ngithub.com/weaveworks/weave-gitops-enterprise/cmd/clusters-service/app.RunInProcessGateway\n\t/app/cmd/clusters-service/app/server.go:661\ngithub.com/weaveworks/weave-gitops-enterprise/cmd/clusters-service/app.StartServer\n\t/app/cmd/clusters-service/app/server.go:518\ngithub.com/weaveworks/weave-gitops-enterprise/cmd/clusters-service/app.NewAPIServerCommand.func3\n\t/app/cmd/clusters-service/app/server.go:196\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968\nmain.main\n\t/app/cmd/clusters-service/main.go:25\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
Error: hydrating pipelines server: cannot start access rule collector: could not start role collector: cannot watch cluster: failed to start watcher for cluster default/prod: cannot create watcher manager: cannot create controller manager: unknown

Which is not great to understand cause, effect, fix

This ticket to

AC

  1. Enhance error messaging around this scenario so a user can recover and
  2. If required after 1) add some info in the faq user documentation https://docs.gitops.weave.works/docs/explorer/operations/
  3. Apply changes to staging cluster

Notes

Scenarios

given leaf cluster connected to wge without collector RBAC when upgraded to 0.22 with explorer enabled

Current behaviour

Expected behaviour

enekofb commented 1 year ago

reproduced locally by not alowing the gitopscluster service account to impersonate collector service account in leaf-custer-1

Error: hydrating query server: cannot start access rule collector: could not start role collector: cannot watch cluster: failed to start watcher for cluster flux-system/leaf-cluster-1: cannot create watcher manager: cannot create controller manager: unknown
enekofb commented 1 year ago

Expected behaviour logged error raising the issue app do not panic app starts with degraded explorer experience

The following logic will be implemented

Now:

Future -> in the context of reliability

enekofb commented 1 year ago

this error just seen also in demo2

it should provide context

{"level":"error","ts":"2023-05-02T09:15:13.299Z","msg":"cannot start watcher","error":"failed to wait for role caches to sync: timed out waiting for cache to be synced"}

enekofb commented 1 year ago

Enabled Staging

https://github.com/weaveworks/weave-gitops-clusters/commit/378f8ccf991ac5094d5a13e96b3de477d96a00b8

And we could see how failures are being logged but

{"level":"error","ts":"2023-05-05T06:52:18.753Z","msg":"Failed to get API Group-Resources","error":"unknown"}
{"level":"error","ts":"2023-05-05T06:52:18.753Z","msg":"cannot watch cluster","cluster":"default/prod","error":"failed to start watcher for cluster default/prod: cannot create watcher manager: cannot create controller manager: unknown"}
{"level":"error","ts":"2023-05-05T06:52:18.763Z","msg":"Failed to get API Group-Resources","error":"unknown"}
{"level":"error","ts":"2023-05-05T06:52:18.763Z","msg":"cannot watch cluster","cluster":"flux-system/dev","error":"failed to start watcher for cluster flux-system/dev: cannot create watcher manager: cannot create controller manager: unknown"}

But the app is up and running

Screenshot 2023-05-05 at 07 54 12