Clusters Service pod grows in memory until being evicted

jpellizzari commented 1 year ago

https://weaveworks.slack.com/archives/CMDJFD3P0/p1691498803210609

Screen Shot 2023-08-08 at 8.46.17 AM.png

I suspect this is a result of the recent PolicyAgent events retention logic added here: https://github.com/weaveworks/weave-gitops-enterprise/pull/2973

enekofb commented 1 year ago

We could see memory usage in staging

https://gitops.internal-dev.wego-gke.weave.works/

enekofb commented 1 year ago

Just disabled Explorer https://github.com/weaveworks/weave-gitops-clusters/commit/749cd941f5c6b6b93c7ea25ba06a82fe2c875746

enekofb commented 1 year ago

After disabled Explorer

Now waiting to see how that picture looks like in a few hours

enekofb commented 1 year ago

With Explorer disabled, we could see the same memory usage trends lower footprint. It means that even Explorer could contribute to faster memory leak, by the graph, does not seem to be the cause of it

jpellizzari commented 1 year ago

@enekofb Since it appears to be a linear correlation when explorer is turned on, could it be our metrics? Since explorer would add lots of metrics by being enabled, that would match a "contributing but not causal" pattern.

Maybe some high cardinality metrics are being added.

jpellizzari commented 1 year ago

Go profiling tools: https://pkg.go.dev/runtime/pprof

enekofb commented 1 year ago

I looked at staging today with this URL http://localhost:3000/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&var-datasource=default&var-cluster=&var-namespace=flux-system&var-type=deployment&var-workload=weave-gitops-enterprise-mccp-cluster-service&from=1692186893267&to=1692279200178

that signals that wge with explorer (in red) is more likely to cause not only a higher memory footprint (which is expected) but be a great factor to memory usage increase over time (flagging a potential memory leak or chance to review memory management)

WGE without explorer enabled, for example in green, flags that trend of increased memory usage but a lower increase rate

squaremo commented 1 year ago

Re the hypothesis that recording Event objects makes it leak faster, are you starting the measurement after the retention period has passed? Otherwise you may just be seeing the number of objects being stored increase.

(Alternatively: are there metrics for the number of objects under storage? controller-runtime might expose some for cache sizes, not sure. MBOI: https://github.com/linkerd/linkerd2/pull/6447)

jpellizzari commented 1 year ago

I had a thought that our index was just growing without ever cleaning up, so I attempted to add some brute force cleanup logic here:

https://github.com/weaveworks/weave-gitops-enterprise/compare/jp-bleve-mem

enekofb commented 1 year ago

Using the following heap extracted from staging cluster

heap.zip

enekofb commented 1 year ago

Starting by these suspects

enekofb commented 1 year ago

Issues that might be related

enekofb commented 1 year ago

Looking at the flame graph we could see how half of the memory gets in the following function

That tracing up seems to be originated around clusters Manager

Given the issues and the evidence here, i could think that the suggested fix might help out here too:

https://github.com/weaveworks/weave-gitops/pull/3983/files

jpellizzari commented 1 year ago

@enekofb Could you document the troubleshooting process and how you are getting these nice visualizations? Maybe an entry in the eng-handbook on the subject of profiling?

enekofb commented 1 year ago

Created a couple of environment for dev purposes

https://github.com/weaveworks/clusters-config/tree/cluster-wge-3189 that runs wge 0.19.0 which is the latest version non using client-go 0.26
https://github.com/weaveworks/clusters-config/tree/cluster-wge-3189-fix that runs wge branch https://github.com/weaveworks/weave-gitops-enterprise/tree/issues/3189 for testing fixes

enekofb commented 1 year ago

in https://wge-3189-fix.eng-sandbox.weave.works

We could see this memory usage over the past few hours

enekofb commented 1 year ago

in https://wge-3189.eng-sandbox.weave.works

We could see this memory usage over the past few hours

enekofb commented 1 year ago

https://wge-3189-fix.eng-sandbox.weave.works/

past 24hs seems stable

with heap.zip

Next is ensuring that there is at least one leaf cluster (to have parity with staging)

With Cluster added we could see an (expected) increase of memory

and memory allocated to the tlsCache

leaving it for some time to see memory usage trends

jpellizzari commented 1 year ago

@enekofb What changed to stabilize the mem usage?

enekofb commented 1 year ago

end of day memory usage snapshot

environment https://wge-3189-fix.eng-sandbox.weave.works/ with a leaf-cluster wge-3189

current memory usage = we could see how still increase of memory usage but less steep than before

With the following heap.zip

where tlsCache is now 5% where before it was ~ 50%

and now the top memory usage are not necessarily clearly related to cluster connections

which means that a) it is controlled or b) we are not hitting the scenario that generated the leak

enekofb commented 1 year ago

@enekofb What changed to stabilize the mem usage?

this for the tls related

@jpellizzari https://github.com/weaveworks/weave-gitops/pull/3983/files

enekofb commented 1 year ago

b) we are not hitting the scenario that generated the leak

we are creating a leaf clusters in staging to exercise the cluster manager with a similar setup

https://wge-3189-fix.eng-sandbox.weave.works/clusters

looking to see how the picture looks like tomorrow

enekofb commented 1 year ago

after last night we have

For https://wge-3189-fix.eng-sandbox.weave.works Memory usage still trending upwards

with lower memory footprint for transport.(*tlsTransportCache).get compared to https://github.com/weaveworks/weave-gitops-enterprise/issues/3189#issuecomment-1703021599

heap.zip

which seems to indicate that: a. the change had an effect in terms of memory consumption but b. there are other places where the same leak occurs and/or c. there might be other leaks

following the trace for b)

enekofb commented 1 year ago

Applied and released fix to cluster.WithFlowControl

https://github.com/weaveworks/weave-gitops/pull/3983/files#diff-e673a2dd89915928a748a8f7927e123d6d163b51ebf697d07a860a21ee6dde96

enekofb commented 1 year ago

this is how it looks in terms of top memory usage for Type: inuse_space

and this how looks for Type: alloc_space

given that is Type: alloc_space 945018 corresponds to the memory shown in grafana, it seems that we need to tackle the question on why the memory is not freed by GC?

jpellizzari commented 1 year ago

Might be worth trying: https://stackoverflow.com/a/31754889

    var transport http.RoundTripper = &http.Transport{
        DisableKeepAlives: true,
    }

enekofb commented 1 year ago

Looking at

https://wge-3189.eng-sandbox.weave.works/clusters which runs Weave GitOps Enterprise 2023-03 0.19.0

We see memory usage going up too so Explorer contributes but it definitely not only comes from Explorer

enekofb commented 1 year ago

Looking at https://wge-3189-fix.eng-sandbox.weave.works/ (with explorer and cluster manager fixes included) we could see this memory usage (even what seems to be a dip in memory - gc ? )

And the heap

heap.zip

With allocated objects showing ~ 45% coming from indexer related

and allocated space like

chasing this trace

enekofb commented 1 year ago

looking beyond heap in use to other memory usages we could see:

memory usage goes up
memory heap goes up but not same level as usage
goroutines does goes up in line with memory usage

which throws the hypothesis that might not be heap but goroutines which holds the memory

enekofb commented 1 year ago

we have disabled explorer and so far we can see:

not increasing number of goroutines
flat total used memory

the hypothesis to observe: an ever-increasing memory usage is created due to an ever-increasing number of goroutines associated with explorer

enekofb commented 1 year ago

After the weekend:

heap 2.zip

We could see the following:

Goroutines didn't increased so we should investigate / fix gouroutines for explorer (to create ticket)
used memory keeps increasing so looking in deeper detail to the heap:

enekofb commented 1 year ago

Used memory keeps increasing so looking in deeper detail to the heap, what happens is:

there are events that requires memory: likely to be related to the frequent operations that allocates memory
memory is requested to the OS so the memory usage for wge increases
memory is used in the heap during a period of time but not returned to the OS
memory some remains idle and eventual will be released
given that operation of moving OS <-> Process is a costly one we could see that ~0.5 GB difference

This is not necessarily an issue per se unless however to have a unbounded memory usage does not seems to be a memory pattern that we want to keep as for example question like

how much memory do I require to run Weave Gitops or
what are the limits i should use for WGE?

Potential actions

do nothing: sets a minimum memory request of 1GB and limits of 2GB for cluster service and observe
To increase metrics retention to see beyond 24hs (up to a week) to see how GC behaves
To determine operations allocations memory and objs and see whether we could use memory more efficiently: reuse memory to avoid requesting more heap
To Understand why GC / process does not release more memory

enekofb commented 1 year ago

Potential actions do nothing: sets a minimum memory request of 1GB and limits of 2GB for cluster service and observe

not acted

To increase metrics retention to see beyond 24hs (up to a week) to see how GC behaves

acted, set to 72hs

To determine operations allocations memory and objs and see whether we could use memory more efficiently: reuse memory to avoid requesting more heap

given allocations happens in the context of polling, have increased the polling frequency to a high number to see the effect in the allocations therefore in the memory

To Understand why GC / process does not release more memory gathering info about it:

https://github.com/golang/go/issues/33376#issuecomment-537742789

enekofb commented 1 year ago

After applying https://github.com/weaveworks/weave-gitops/pull/3983/commits/d18dc14ffa827ff1e860989fd88a8e366448f70c

we could see the following

Screenshot 2023-09-13 at 10 31 03

a clear impact in the memory usage (that is increasing at slower rate)

Screenshot 2023-09-13 at 10 37 47

and allocations that have reduced.

We could also see now that the allocations are coming from the same method but from progressive delivery

Screenshot 2023-09-13 at 10 35 25

Obs, the increase of the period has an impact on detecting changes but we need to understand what a balanced values is.

enekofb commented 1 year ago

Applied change to progressive delivery https://github.com/weaveworks/weave-gitops-enterprise/commit/083de2122927da96303d4c1e853aa330adab73c1

we could see the same trend:

reduced the number of allocation
reduced the memory usage in total
heap shows a reduction in the allocations coming from watchCRDs

heap.zip

enekofb commented 1 year ago

Still unknown:

whether the latest changes help out with the unbounded memory growth: we need more data points.

Suggested Actions at the back of the latest understanding:

In order to have a single efficient maintained layer for cluster management: to ensure that all capabilities within weave gitops use the same cluster management logic: in this context, align progressive delivery to user weave gitops cluster management (likely we require to extract it as a library).
Review fetcher to make it more memory efficient
In case not, still memory unbounded, to create a follow-up issue to validate the hypothesis that in reality is not an issue but just that the runtime does not free memory as no other process requires it and it is an expensive operation to do. Add it as part of wge operations to set resource limits for WGE - this

enekofb commented 1 year ago

After 72hs of data we could see that memory heap usage follows a trend upwards at slower rate than before

Screenshot 2023-09-18 at 10 25 19

with the following heap

heap.zip

enekofb commented 1 year ago

hypothesis to potentially follow up at the back of the previous and following graph:

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

Screenshot 2023-09-18 at 18 28 00

b) to see whether the number of connections are also increasing and in that case whether we could reuse over create

Screenshot 2023-09-18 at 17 16 02

enekofb commented 1 year ago

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

enabled cached clients https://github.com/weaveworks/clusters-config/commit/088964122aed3a6eed8ec6ff266c98a9448d8068

with the following looking:

Screenshot 2023-09-19 at 08 26 40

that indicates that the feature end up in memory and goroutine leaks

And heap.zip that indicates that

50% of the memory comes from secrets

Screenshot 2023-09-19 at 08 51 51

with the following call chain

Screenshot 2023-09-19 at 08 53 10

as a result, it didn't help to address the original problem so reverting

enekofb commented 1 year ago

a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)

given the following path of inuse objects

Screenshot 2023-09-19 at 09 54 18

we disable flowControl to understand whether the leak comes from it

we could see an initial timewindow of 15 mins where the leaf clusters polling happens:

Screenshot 2023-09-19 at 10 01 39

where we see that it results in an increase of heap memory (to be confirmed whether would be a trend to persist)

After 18hs hours we could see the following graph

Screenshot 2023-09-20 at 07 57 12

with the following heap.zip

that indicates that:

we still have memory heap increase and live objects increase at slower rate with the max cum memory from our calls are still coming from cluster manager cluster.getClientFromConfig

Screenshot 2023-09-20 at 10 16 08

this is our next target

enekofb commented 1 year ago

Summary of findings so far affecting resource consumption increase likely being a leak and therefore contributing to eviction:

Starting point:

As indicated by the ticket, we have WGE running that setting up some limits ends up in eviction.

Testing scenario

WGE running without user interaction to determine what are the factors that leak resources. https://wge-3189-fix.eng-sandbox.weave.works
A few leaf clusters setup with one valid and some invalid to see the contributing factor for both cases

Tooling

pprof and goruntime metrics
Monitoring via prom/grafana

Hypothesis and Findings

We walk the set of hypothesis and findings that you could read out of the comments in this ticket

H1: Memory leak happens around Explorer due to event management

Actions: to disable explorer
Results: we could see that contributed to the memory increase but wasn't the solely cause as memory heap still increasing and we could also see that happens in earlier versions without explorer
References

H2: Memory leak happens around cluster manager due to a leak in tls cache

Actions: to apply fix suggested in the related issues
Results: we could see that contributed to the memory increase but wasn't the solely cause as memory heap still increasing
References from this comment

H3: Memory leak happens due to leak related to clusters polling

Actions: increase period from 30 secs to 10 mins
Results: we could see a great reduction in memory consumption
References from this comment

H4: Memory leak happens due to leak on clusters manager flow control feature

Actions: disabled flow control feature
Results: we could see stable memory usage so far and rest of parameters stable
References from this comment

Issues found at the back of this investigation that contributes to leaks

Explorer goroutine management for explorer
Clusters manager polling mechanism and period
Clusters manager flowControl
K8s cached client memory / gouroutine consumption

enekofb commented 1 year ago

Based on previous hypothesis we work out

that indicates that:

we still have memory heap increase and live objects increase at slower rate with the max cum memory from our calls are still coming from cluster manager cluster.getClientFromConfig that the hypothesis is that still create connections

Screenshot 2023-09-20 at 10 30 49

By looking into monitoring clusters manage we found thet following metric create_server_client_total

that plotting in our dashboard we could see the increasing trend

Screenshot 2023-09-20 at 10 31 55

which indicates another leak

We could see how there is cache for no creating clients all the times so debugging this part to see what is the failure.

H5: Memory leak happens due to leak on clusters manager server connections manager

Actions: https://github.com/weaveworks/weave-gitops/commit/d97c5015819289787cfcd6120a24e10e9e96ff8c https://github.com/weaveworks/weave-gitops/commit/c6cc497d3c09bcadf019236ec2be8cb08b7e7d02

added debugging logs to identify why not cached clients are returned
increased ttl to 24hs

Results: TBD

enekofb commented 1 year ago

ttl wrongly set

{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"created server client","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":30}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"found after set"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set client","user":"weave-gitops-server","cluster":"flux-system/dev"}
W0920 10:29:04.889951       8 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta2.HelmRepository: helmrepositories.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:defa

ttl rightly set

Defaulted container "clusters-service" out of: clusters-service, mount-ui-server (init)
I0920 11:51:10.791411       7 request.go:690] Waited for 1.035823111s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/notification.toolkit.fluxcd.io/v1beta2?timeout=32s
{"level":"info","ts":"2023-09-20T11:51:12.125Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177,"ttl":86400}

now we could see connections reused

{"level":"info","ts":"2023-09-20T11:56:42.333Z","logger":"gitops-cluster-fetcher","msg":"Found clusters","clusters":["flux-system/dev","flux-system/prod"]}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/prod"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"management"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"found cached client"}

We could see flat connections now and memory usage decreasing

Screenshot 2023-09-20 at 13 59 57

enekofb commented 1 year ago

Last 17hs shows the following picture

Screenshot 2023-09-21 at 09 54 09

That pretty much shows stable:

number of live objects
heap memory
total memory
slower memory footprint

enekofb commented 1 year ago

We will leave it running to get more hours of data but if the trends remains we won't do more actions but recap and open follow-up tickets.

enekofb commented 1 year ago

after the weekend we could confirm a picture without a ever-growing heap nor number of objects

Screenshot 2023-09-25 at 08 53 22

enekofb commented 1 year ago

wrapping up

created epic and ticket
next is to create documentation on the process so could be replicated by someone else in the future

weaveworks / weave-gitops-enterprise

Clusters Service pod grows in memory until being evicted #3189