wiz-sec / charts

GNU General Public License v3.0
27 stars 32 forks source link

argo cd application for wiz-kubernetes-integration after a while becomes outofsync #273

Open dhorner71 opened 6 months ago

dhorner71 commented 6 months ago

after some time after successful install, argocd reports that the application (wiz-kubernetes-integration helm chart) is out of sync and unable to self heal.

wiz-kubernetes-integration-wiz-admission-controller: reported manifest diff that is unable to resolve/self heal: rollme.webhookCert

argocd sync logs: deleting wiz-auto-modify-connector service account

workaround: manual deleting of service account resumes sync successfully. this step seems to kick off the integration job which starts to properly reinstall all the respective resources

environment: app.kubernetes.io/chartName: wiz-admission-controller app.kubernetes.io/instance: wiz-kubernetes-integration app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: wiz-admission-controller app.kubernetes.io/version: '2.4' helm.sh/chart: wiz-admission-controller-3.4.13 wiz helm chart 0.1.85 AWS EKS 1.27

dhorner71 commented 6 months ago

we currently have three separate clusters with two of the three exhibiting this behavior. after successfully resyncing the two by manually deleting that service account, all three clusters are still in good health and not out of sync for 24 hours. i will continue to monitor. we have 4 other clusters that will be deployed to too in the near future so i'll be able to report their status soon.

dhorner71 commented 6 months ago

another 24 hours and no symptoms. closing ticket.

dhorner71 commented 5 months ago

4 out of our 8 clusters are reporting out of sync in argo cd this morning. will research and post relevant logs.

dhorner71 commented 5 months ago

wiz-kubernetes-integration-wiz-admission-controller logs:

{"level":"info","time":"2024-03-18T20:20:51.874568707Z","msg":"Auth data is expired, authenticating client","expiresAt":"2024-03-18T20:05:52.514527851Z","timeSinceExpired":"14m59.359973513s"}
{"level":"info","time":"2024-03-18T20:21:03.547774867Z","msg":"Auth data is expired, authenticating client","expiresAt":"2024-03-18T20:06:03.792072141Z","timeSinceExpired":"14m59.755684293s"}
{"level":"error","time":"2024-03-18T20:21:21.876081997Z","msg":"error posting token request to url=https://auth.app.wiz.io/oauth/token, status=, resp=","error":"Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"error","time":"2024-03-18T20:21:21.876359448Z","msg":"Failed to reauthenticate client","error":"failed authenticating with credentials: error posting token request: Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"info","time":"2024-03-18T20:21:21.876521311Z","msg":"Auth data is not cached, authenticating client"}
{"level":"error","time":"2024-03-18T20:21:33.549479388Z","msg":"error posting token request to url=https://auth.app.wiz.io/oauth/token, status=, resp=","error":"Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"error","time":"2024-03-18T20:21:33.54972353Z","msg":"Failed to reauthenticate client","error":"failed authenticating with credentials: error posting token request: Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"info","time":"2024-03-18T20:21:33.549820082Z","msg":"Auth data is not cached, authenticating client"}
{"level":"error","time":"2024-03-18T20:21:51.87763835Z","msg":"error posting token request to url=https://auth.app.wiz.io/oauth/token, status=, resp=","error":"Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"error","time":"2024-03-18T20:21:51.877897953Z","msg":"Failed to reauthenticate client","error":"failed authenticating with credentials: error posting token request: Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"error","time":"2024-03-18T20:22:03.550501554Z","msg":"error posting token request to url=https://auth.app.wiz.io/oauth/token, status=, resp=","error":"Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
{"level":"error","time":"2024-03-18T20:22:03.550656581Z","msg":"Failed to reauthenticate client","error":"failed authenticating with credentials: error posting token request: Post \"https://auth.app.wiz.io/oauth/token\": dial tcp i/o timeout"}
h2_bundle.go:4527: http2: server: error reading preface from client read tcp> read: connection reset by peer
{"level":"info","time":"2024-03-18T21:20:51.875174756Z","msg":"Auth data is not cached, authenticating client"}

looking into any additional network policy modifications needed based on these entries

dhorner71 commented 5 months ago

we run an alternate pod ip scheme (calico) and found a comment in the admission controller values template (https://github.com/wiz-sec/charts/blob/master/wiz-admission-controller/values.yaml) about the webhook and host network flag. i've set it to true and selected a different port other than 10250. i guess i need to wait for the webhookCert to renew to see if this works.

Cr0n1c commented 5 months ago

I am experiencing this on all my clusters as well. Any chance we can get wiz team to look at this?

dhorner71 commented 5 months ago

submitted support ticket https://support.wiz.io/hc/en-us/requests/24387

dhorner71 commented 5 months ago

we are still experiencing these issues on several clusters. on most of our nonprod EKS cluster, we scale to 0 nodes over night and on weekends. these specific clusters cannot successfully hydrate pods in the morning and report in unhealthy state.