weaveworks / weave-gitops-enterprise

This repo provides the enterprise level features for the weave-gitops product, including CAPI cluster creation and team workspaces.
https://docs.gitops.weave.works/
Apache License 2.0
160 stars 29 forks source link

Clusters Service pod grows in memory until being evicted #3189

Closed jpellizzari closed 1 year ago

jpellizzari commented 1 year ago

https://weaveworks.slack.com/archives/CMDJFD3P0/p1691498803210609

image.pngScreen Shot 2023-08-08 at 8.46.17 AM.png

I suspect this is a result of the recent PolicyAgent events retention logic added here: https://github.com/weaveworks/weave-gitops-enterprise/pull/2973

enekofb commented 1 year ago

We could see memory usage in staging

https://gitops.internal-dev.wego-gke.weave.works/

Screenshot 2023-08-15 at 14 55 16
enekofb commented 1 year ago

Just disabled Explorer https://github.com/weaveworks/weave-gitops-clusters/commit/749cd941f5c6b6b93c7ea25ba06a82fe2c875746

enekofb commented 1 year ago

After disabled Explorer

Screenshot 2023-08-15 at 15 33 52

Now waiting to see how that picture looks like in a few hours

enekofb commented 1 year ago

With Explorer disabled, we could see the same memory usage trends lower footprint. It means that even Explorer could contribute to faster memory leak, by the graph, does not seem to be the cause of it

Screenshot 2023-08-16 at 07 36 39
jpellizzari commented 1 year ago

@enekofb Since it appears to be a linear correlation when explorer is turned on, could it be our metrics? Since explorer would add lots of metrics by being enabled, that would match a "contributing but not causal" pattern.

Maybe some high cardinality metrics are being added.

jpellizzari commented 1 year ago

Go profiling tools: https://pkg.go.dev/runtime/pprof

enekofb commented 1 year ago

I looked at staging today with this URL http://localhost:3000/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&var-datasource=default&var-cluster=&var-namespace=flux-system&var-type=deployment&var-workload=weave-gitops-enterprise-mccp-cluster-service&from=1692186893267&to=1692279200178

that signals that wge with explorer (in red) is more likely to cause not only a higher memory footprint (which is expected) but be a great factor to memory usage increase over time (flagging a potential memory leak or chance to review memory management)

Screenshot 2023-08-17 at 15 16 44

WGE without explorer enabled, for example in green, flags that trend of increased memory usage but a lower increase rate

Screenshot 2023-08-17 at 15 21 43
squaremo commented 1 year ago

Re the hypothesis that recording Event objects makes it leak faster, are you starting the measurement after the retention period has passed? Otherwise you may just be seeing the number of objects being stored increase.

(Alternatively: are there metrics for the number of objects under storage? controller-runtime might expose some for cache sizes, not sure. MBOI: https://github.com/linkerd/linkerd2/pull/6447)

jpellizzari commented 1 year ago

I had a thought that our index was just growing without ever cleaning up, so I attempted to add some brute force cleanup logic here:

https://github.com/weaveworks/weave-gitops-enterprise/compare/jp-bleve-mem

enekofb commented 1 year ago

Using the following heap extracted from staging cluster

heap.zip

enekofb commented 1 year ago

Starting by these suspects

Screenshot 2023-09-01 at 16 21 07
enekofb commented 1 year ago

Issues that might be related

enekofb commented 1 year ago

Looking at the flame graph we could see how half of the memory gets in the following function

Screenshot 2023-09-01 at 17 32 02

That tracing up seems to be originated around clusters Manager

Given the issues and the evidence here, i could think that the suggested fix might help out here too:

https://github.com/weaveworks/weave-gitops/pull/3983/files

jpellizzari commented 1 year ago

@enekofb Could you document the troubleshooting process and how you are getting these nice visualizations? Maybe an entry in the eng-handbook on the subject of profiling?

enekofb commented 1 year ago

Created a couple of environment for dev purposes

enekofb commented 1 year ago

in https://wge-3189-fix.eng-sandbox.weave.works

We could see this memory usage over the past few hours

Screenshot 2023-09-04 at 11 47 57
enekofb commented 1 year ago

in https://wge-3189.eng-sandbox.weave.works

We could see this memory usage over the past few hours

Screenshot 2023-09-04 at 11 50 29
enekofb commented 1 year ago

https://wge-3189-fix.eng-sandbox.weave.works/

past 24hs seems stable

Screenshot 2023-09-05 at 08 18 26

with heap.zip

Screenshot 2023-09-05 at 08 24 16

Next is ensuring that there is at least one leaf cluster (to have parity with staging)

With Cluster added we could see an (expected) increase of memory

Screenshot 2023-09-05 at 08 52 07

and memory allocated to the tlsCache

Screenshot 2023-09-05 at 08 53 16

leaving it for some time to see memory usage trends

jpellizzari commented 1 year ago

@enekofb What changed to stabilize the mem usage?

enekofb commented 1 year ago

end of day memory usage snapshot

environment https://wge-3189-fix.eng-sandbox.weave.works/ with a leaf-cluster wge-3189

current memory usage = we could see how still increase of memory usage but less steep than before

Screenshot 2023-09-05 at 17 38 53

With the following heap.zip

where tlsCache is now 5% where before it was ~ 50%

Screenshot 2023-09-05 at 17 45 20

and now the top memory usage are not necessarily clearly related to cluster connections

Screenshot 2023-09-05 at 17 50 02

which means that a) it is controlled or b) we are not hitting the scenario that generated the leak

enekofb commented 1 year ago

@enekofb What changed to stabilize the mem usage?

this for the tls related

@jpellizzari https://github.com/weaveworks/weave-gitops/pull/3983/files

enekofb commented 1 year ago

b) we are not hitting the scenario that generated the leak

we are creating a leaf clusters in staging to exercise the cluster manager with a similar setup

https://wge-3189-fix.eng-sandbox.weave.works/clusters

Screenshot 2023-09-05 at 17 55 34

looking to see how the picture looks like tomorrow

enekofb commented 1 year ago

after last night we have

For https://wge-3189-fix.eng-sandbox.weave.works Memory usage still trending upwards

Screenshot 2023-09-06 at 07 05 25

with lower memory footprint for transport.(*tlsTransportCache).get compared to https://github.com/weaveworks/weave-gitops-enterprise/issues/3189#issuecomment-1703021599

Screenshot 2023-09-06 at 07 34 52

heap.zip

which seems to indicate that: a. the change had an effect in terms of memory consumption but b. there are other places where the same leak occurs and/or c. there might be other leaks

following the trace for b)

enekofb commented 1 year ago

Applied and released fix to cluster.WithFlowControl

https://github.com/weaveworks/weave-gitops/pull/3983/files#diff-e673a2dd89915928a748a8f7927e123d6d163b51ebf697d07a860a21ee6dde96

enekofb commented 1 year ago

this is how it looks in terms of top memory usage for Type: inuse_space

Screenshot 2023-09-06 at 08 15 52

and this how looks for Type: alloc_space

Screenshot 2023-09-06 at 08 17 36

given that is Type: alloc_space 945018 corresponds to the memory shown in grafana, it seems that we need to tackle the question on why the memory is not freed by GC?

jpellizzari commented 1 year ago

Might be worth trying: https://stackoverflow.com/a/31754889

    var transport http.RoundTripper = &http.Transport{
        DisableKeepAlives: true,
    }
enekofb commented 1 year ago

Looking at

https://wge-3189.eng-sandbox.weave.works/clusters which runs Weave GitOps Enterprise 2023-03 0.19.0

We see memory usage going up too so Explorer contributes but it definitely not only comes from Explorer

Screenshot 2023-09-08 at 08 52 49
enekofb commented 1 year ago

Looking at https://wge-3189-fix.eng-sandbox.weave.works/ (with explorer and cluster manager fixes included) we could see this memory usage (even what seems to be a dip in memory - gc ? )

Screenshot 2023-09-08 at 09 34 14

And the heap

heap.zip

With allocated objects showing ~ 45% coming from indexer related

Screenshot 2023-09-08 at 09 40 25

and allocated space like

Screenshot 2023-09-08 at 09 41 37

chasing this trace

enekofb commented 1 year ago

looking beyond heap in use to other memory usages we could see:

Screenshot 2023-09-08 at 11 19 48

which throws the hypothesis that might not be heap but goroutines which holds the memory

enekofb commented 1 year ago

we have disabled explorer and so far we can see:

Screenshot 2023-09-08 at 15 36 20

the hypothesis to observe: an ever-increasing memory usage is created due to an ever-increasing number of goroutines associated with explorer

enekofb commented 1 year ago

After the weekend:

Screenshot 2023-09-11 at 08 45 03 Screenshot 2023-09-11 at 08 51 29

heap 2.zip

Screenshot 2023-09-11 at 08 54 13 Screenshot 2023-09-11 at 08 51 29 Screenshot 2023-09-11 at 15 49 20

We could see the following:

Screenshot 2023-09-11 at 15 55 16 Screenshot 2023-09-11 at 15 56 15
enekofb commented 1 year ago

Used memory keeps increasing so looking in deeper detail to the heap, what happens is:

This is not necessarily an issue per se unless however to have a unbounded memory usage does not seems to be a memory pattern that we want to keep as for example question like

Potential actions

enekofb commented 1 year ago

Potential actions do nothing: sets a minimum memory request of 1GB and limits of 2GB for cluster service and observe

not acted

To increase metrics retention to see beyond 24hs (up to a week) to see how GC behaves

acted, set to 72hs

To determine operations allocations memory and objs and see whether we could use memory more efficiently: reuse memory to avoid requesting more heap

given allocations happens in the context of polling, have increased the polling frequency to a high number to see the effect in the allocations therefore in the memory

To Understand why GC / process does not release more memory gathering info about it:

enekofb commented 1 year ago

After applying https://github.com/weaveworks/weave-gitops/pull/3983/commits/d18dc14ffa827ff1e860989fd88a8e366448f70c

we could see the following

Screenshot 2023-09-13 at 10 31 03

a clear impact in the memory usage (that is increasing at slower rate)

Screenshot 2023-09-13 at 10 37 47

and allocations that have reduced.

We could also see now that the allocations are coming from the same method but from progressive delivery

Screenshot 2023-09-13 at 10 35 25

Obs, the increase of the period has an impact on detecting changes but we need to understand what a balanced values is.

enekofb commented 1 year ago

Applied change to progressive delivery https://github.com/weaveworks/weave-gitops-enterprise/commit/083de2122927da96303d4c1e853aa330adab73c1

we could see the same trend:

Screenshot 2023-09-14 at 09 03 07 Screenshot 2023-09-14 at 09 04 45

heap.zip

enekofb commented 1 year ago

Still unknown:

  1. whether the latest changes help out with the unbounded memory growth: we need more data points.

Suggested Actions at the back of the latest understanding:

  1. In order to have a single efficient maintained layer for cluster management: to ensure that all capabilities within weave gitops use the same cluster management logic: in this context, align progressive delivery to user weave gitops cluster management (likely we require to extract it as a library).
  2. Review fetcher to make it more memory efficient
  3. In case not, still memory unbounded, to create a follow-up issue to validate the hypothesis that in reality is not an issue but just that the runtime does not free memory as no other process requires it and it is an expensive operation to do. Add it as part of wge operations to set resource limits for WGE - this
enekofb commented 1 year ago

After 72hs of data we could see that memory heap usage follows a trend upwards at slower rate than before

Screenshot 2023-09-18 at 10 25 19

with the following heap

heap.zip

enekofb commented 1 year ago

hypothesis to potentially follow up at the back of the previous and following graph:

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

Screenshot 2023-09-18 at 18 28 00

b) to see whether the number of connections are also increasing and in that case whether we could reuse over create

Screenshot 2023-09-18 at 17 16 02

enekofb commented 1 year ago

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

enabled cached clients https://github.com/weaveworks/clusters-config/commit/088964122aed3a6eed8ec6ff266c98a9448d8068

with the following looking:

Screenshot 2023-09-19 at 08 26 40

that indicates that the feature end up in memory and goroutine leaks

And heap.zip that indicates that

50% of the memory comes from secrets

Screenshot 2023-09-19 at 08 51 51

with the following call chain

Screenshot 2023-09-19 at 08 53 10

as a result, it didn't help to address the original problem so reverting

enekofb commented 1 year ago

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

given the following path of inuse objects

Screenshot 2023-09-19 at 09 54 18

we disable flowControl to understand whether the leak comes from it

we could see an initial timewindow of 15 mins where the leaf clusters polling happens:

Screenshot 2023-09-19 at 10 01 39

where we see that it results in an increase of heap memory (to be confirmed whether would be a trend to persist)

After 18hs hours we could see the following graph

Screenshot 2023-09-20 at 07 57 12

with the following heap.zip

that indicates that:

Screenshot 2023-09-20 at 10 16 08

this is our next target

enekofb commented 1 year ago

Summary of findings so far affecting resource consumption increase likely being a leak and therefore contributing to eviction:

Starting point:

As indicated by the ticket, we have WGE running that setting up some limits ends up in eviction.

Testing scenario

Tooling

Hypothesis and Findings

We walk the set of hypothesis and findings that you could read out of the comments in this ticket

H1: Memory leak happens around Explorer due to event management

H2: Memory leak happens around cluster manager due to a leak in tls cache

H3: Memory leak happens due to leak related to clusters polling

H4: Memory leak happens due to leak on clusters manager flow control feature

Issues found at the back of this investigation that contributes to leaks

enekofb commented 1 year ago

Based on previous hypothesis we work out

that indicates that:

we still have memory heap increase and live objects increase at slower rate with the max cum memory from our calls are still coming from cluster manager cluster.getClientFromConfig that the hypothesis is that still create connections

Screenshot 2023-09-20 at 10 30 49

By looking into monitoring clusters manage we found thet following metric create_server_client_total

that plotting in our dashboard we could see the increasing trend

Screenshot 2023-09-20 at 10 31 55

which indicates another leak

We could see how there is cache for no creating clients all the times so debugging this part to see what is the failure.

H5: Memory leak happens due to leak on clusters manager server connections manager

Actions: https://github.com/weaveworks/weave-gitops/commit/d97c5015819289787cfcd6120a24e10e9e96ff8c https://github.com/weaveworks/weave-gitops/commit/c6cc497d3c09bcadf019236ec2be8cb08b7e7d02

Results: TBD

enekofb commented 1 year ago

ttl wrongly set

{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"created server client","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":30}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"found after set"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set client","user":"weave-gitops-server","cluster":"flux-system/dev"}
W0920 10:29:04.889951       8 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta2.HelmRepository: helmrepositories.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:defa

ttl rightly set

Defaulted container "clusters-service" out of: clusters-service, mount-ui-server (init)
I0920 11:51:10.791411       7 request.go:690] Waited for 1.035823111s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/notification.toolkit.fluxcd.io/v1beta2?timeout=32s
{"level":"info","ts":"2023-09-20T11:51:12.125Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177,"ttl":86400}

now we could see connections reused

{"level":"info","ts":"2023-09-20T11:56:42.333Z","logger":"gitops-cluster-fetcher","msg":"Found clusters","clusters":["flux-system/dev","flux-system/prod"]}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/prod"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"management"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"found cached client"}

We could see flat connections now and memory usage decreasing

Screenshot 2023-09-20 at 13 59 57

enekofb commented 1 year ago

Last 17hs shows the following picture

Screenshot 2023-09-21 at 09 54 09

That pretty much shows stable:

enekofb commented 1 year ago

We will leave it running to get more hours of data but if the trends remains we won't do more actions but recap and open follow-up tickets.

enekofb commented 1 year ago

after the weekend we could confirm a picture without a ever-growing heap nor number of objects

Screenshot 2023-09-25 at 08 53 22

enekofb commented 1 year ago

wrapping up