weaveworks / weave-gitops

Weave GitOps provides insights into your application deployments, and makes continuous delivery with GitOps easier to adopt and scale across your teams.
https://docs.gitops.weave.works/
Apache License 2.0
919 stars 153 forks source link

GitOps Run uses wrong kubectl context when cleaning up #3456

Closed makkes closed 1 year ago

makkes commented 1 year ago

Describe the bug

@kingdonb says "I switch my cluster context with kubectx and the gitops run command insists on approval to run on a specific cluster context, but when I've switched the context and Ctrl+C later, it doesn't seem to remember that we did that, it thinks the current context is the correct one, and things don't get cleaned up."

Expected behavior

opudrovs commented 1 year ago

@kingdonb thank you for reporting the issue! Could you please add the following info?

This will help me to investigate the issue.

kingdonb commented 1 year ago

Thanks for looking at this! I'll try to reproduce it now on the latest version, and attach the requested information.

kingdonb commented 1 year ago

I'm not trying to open a new issue for every UI quirk in gitops run, so while I'm on the way to gathering that information, I'm just going to chronicle a few things that went wrong on the way there, maybe we can try to close them all as a block. Sorry that this report is a bit disjoint and doesn't line up with the initial notes. Please be aware of that.

I think that I've been unable to reproduce the issue this report is initially about, it's good news, but I had more issues on the way. Rather than recycle the issue document, let's try to nail down some other UX issues I'm experiencing?

The first thing that has gone wrong, I'm targeting a vcluster that runs on a vcluster, so having failed to invoke with --no-session on the first attempt, I wound up triggering #3591 which I aborted with ^C since this vcluster has no nodes of its own, and the hostpath mapper will fail to run. At this point, I think some things are not getting cleaned up. I have a feeling I incorrectly identified these as related to the failed --no-session invocation.

On the second attempt, I noticed that I had already a fluent-bit from one of the prior attempts, and the workflow correctly detected it, and detected that it was ready. Nothing failed there. I see a run-main-...-dirty-hostpath-mapper helmrelease, which I surmise is from the session, and I delete it. (Perhaps I had mixed up session and no-session mode, and this is the stray resource I initially meant to report about encountering after which this report was initially created.)

The ww-gitops dashboard Kustomization fails, because this cluster has been provisioned by WGE, and following the instructions I've created a wego-admin-cluster-role ClusterRole and ClusterRoleBinding already with the clusters-bases-kustomization. These resources now can't be created by Helm, without special hand-holding to ensure they get taken over.

I've figured out that I can apply these annotations and labels in the bases, and reuse the clusterrole and clusterrolebinding from that base config: https://github.com/kingdon-ci/fleet-infra/commit/affead386b8b2cbae5628afcb633349e4551c48b

Unsure if that should be documented, is already documented, or if there is a better workaround possible. (Moving on...)

At this point, the path I targeted is syncing in the cluster and the dashboard is coming online. ^C is cleaning up after the workflow, and I can repeat gitops beta run ./podinfo-simple which comes online A-OK, and cleans up after self A-OK.

$ kubectl config get-contexts
CURRENT   NAME           CLUSTER        AUTHINFO    NAMESPACE
          cluster-02     cluster-02     kubelogin   podinfo-flux-oci
          hephy-stg      hephy-stg      kubelogin
          howard-space   howard-space   kubelogin   botkube
          howard-stage   howard-stage   kubelogin
*         limnocentral   limnocentral   kubelogin   default
          management     management     kubelogin   vcluster-hephy-stg-turkey-local
          moo            moo            kubelogin   harbor

This is the vcluster within a vcluster that I'm using with gitops beta run --no-session. I pretend that something has gone wrong and switch to the parent cluster, while my non-session is running in the limnocentral context to check on something I think might be related to "the reason why it's failing" – never mind that everything's working at this point, I want to see if I can get it to fail to clean up after itself by changing contexts.

$ kubectx
✔ Switched to context "howard-space".
$ kubectl config get-contexts
CURRENT   NAME           CLUSTER        AUTHINFO    NAMESPACE
          cluster-02     cluster-02     kubelogin   podinfo-flux-oci
          hephy-stg      hephy-stg      kubelogin
*         howard-space   howard-space   kubelogin   botkube
          howard-stage   howard-stage   kubelogin
          limnocentral   limnocentral   kubelogin   default
          management     management     kubelogin   vcluster-hephy-stg-turkey-local
          moo            moo            kubelogin   harbor

This is in a second terminal, while gitops run still runs in the original terminal. The resources from the --no-session gitops run are cleaned up (fluent-bit, run-dev-ks, run-dev-bucket, and the gitops-run namespace), the dashboard is left installed (I think that was on purpose, I can still port-forward into it and the dashboard works as expected) and Flux of course remains installed, (thankfully, as it was also installed before gitops run...)

$ gitops beta run podinfo-simple --no-session --allow-k8s-context=limnocentral
► Checking for a cluster in the kube config ...
► Explicitly allow GitOps Run on limnocentral context
► Checking if Flux is already installed ...
► Getting Flux version ...
✔ Flux &{v0.41.2  flux-system} is already installed
► Checking namespace gitops-run ...
✔ Created namespace gitops-run
► Checking service gitops-run/run-dev-bucket ...
✔ Created service gitops-run/run-dev-bucket
► Checking deployment gitops-run/run-dev-bucket ...
✔ Created deployment gitops-run/run-dev-bucket
► Waiting for deployment run-dev-bucket to be ready ...
► Port forwarding to pod gitops-run/run-dev-bucket-58fb69496c-mc2tr ...
✔ Port forwarding for run-dev-bucket is ready.
► creating HelmRepository flux-system/fluent
► creating HelmRelease flux-system/fluent-bit
► creating HelmRelease flux-system/fluent-bit
► waiting for HelmRelease flux-system/fluent-bit to be ready
✔ HelmRelease flux-system/fluent-bit is ready
► Checking if GitOps Dashboard is already installed ...
✔ GitOps Dashboard is found
► Request reconciliation of dashboard (timeout 5m0s) ...
✔ Dashboard reconciliation is done.
► Port forwarding to pod flux-system/ww-gitops-weave-gitops-748cb58f74-zr52m ...
✔ Port forwarding for dashboard is ready.
► Checking secret run-dev-bucket-credentials ...
✔ Created secret run-dev-bucket-credentials
✔ Secret run-dev-bucket-credentials already existed
► Checking bucket source run-dev-bucket ...
✔ Created source run-dev-bucket
✔ Source run-dev-bucket already existed
► Checking Kustomization run-dev-ks ...
✔ Created Kustomization run-dev-ks
✔ Setup Bucket Source and Kustomization successfully
◎ Press Ctrl+C to stop GitOps Run ...
► 1 change events detected
► Validating files under examples/podinfo-simple/ ...
► Refreshing bucket run-dev-bucket ...
...............
► Uploaded 159 files
► Request reconciliation of GitOps Run resources (timeout 5m0s) ...
✔ Reconciliation is done.

We set up port forwards for you, use the number below to open it in the browser

(1) ww-gitops: http://localhost:9001

► Received interrupt, quitting...
► Removing Fluent Bit HelmRelease flux-system/fluent-bit ...
► Waiting for HelmRelease flux-system/fluent-bit to be deleted...
✔ HelmRelease flux-system/fluent-bit deleted

► Deleting Kustomization run-dev-ks ...
✔ Deleted Kustomization run-dev-ks
► Deleting secret run-dev-bucket-credentials ...
✔ Deleted secret run-dev-bucket-credentials
► Deleting source run-dev-bucket ...
✔ Deleted source run-dev-bucket
✔ Cleanup Bucket Source and Kustomization successfully
► Removing namespace gitops-run ...
► Waiting for namespace gitops-run to be terminated ...
✔ Namespace gitops-run terminated

So, the initial report was a tiny bit overzealous but there are still some improvements we can hopefully take.

(Should there be similar cleanup routines for the gitops session that fails? That hostpath mapper thing that gets held up is at least about 60% of what I am tripping over in total. Is there any way we could have detected that session mode isn't going to work for me, and default to suggesting no-session and the admonishment to provide allow-k8s-context, with some hint about what this means? I understand now what session is and what it's for, but it'll still be quite inscrutable to the new user.)

kingdonb commented 1 year ago

I'm happy to close this since the report is mixed up, but I'll leave it open, mainly in case there is something from the detail that we'd now like to follow up in a separate issue (so it does not get lost) or to clarify about the report before it gets closed.

opudrovs commented 1 year ago

Thank you, @kingdonb, very useful notes! ✨

We'll discuss your comment tomorrow at the sync and will decide what new issues should be opened.

There is hope that some of these issue might be fixed with the current PRs we are working on (improved dashboard detection by me, which will prevent GitOps Run from installing a dashboard if the enterprise dashboard is detected, and fixing cleanup by @chanwit ) But we need to look into it in more detail and test it.

If any issues are still remaining, we'll let you know if we need additional information to reproduce them. Thank you again for such a detailed report!

kingdonb commented 1 year ago

Alright, thanks for following these issues up! 🎉

katya-makarevich commented 1 year ago

cannot reproduce

opudrovs commented 1 year ago

We'll re-test if additional reported issues can be reproduced after the current PRs are merged and will open additional issues, if needed.