Closed jpellizzari closed 1 year ago
We could see memory usage in staging
After disabled Explorer
Now waiting to see how that picture looks like in a few hours
With Explorer disabled, we could see the same memory usage trends lower footprint. It means that even Explorer could contribute to faster memory leak, by the graph, does not seem to be the cause of it
@enekofb Since it appears to be a linear correlation when explorer is turned on, could it be our metrics? Since explorer would add lots of metrics by being enabled, that would match a "contributing but not causal" pattern.
Maybe some high cardinality metrics are being added.
Go profiling tools: https://pkg.go.dev/runtime/pprof
I looked at staging today with this URL http://localhost:3000/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&var-datasource=default&var-cluster=&var-namespace=flux-system&var-type=deployment&var-workload=weave-gitops-enterprise-mccp-cluster-service&from=1692186893267&to=1692279200178
that signals that wge with explorer (in red) is more likely to cause not only a higher memory footprint (which is expected) but be a great factor to memory usage increase over time (flagging a potential memory leak or chance to review memory management)
WGE without explorer enabled, for example in green, flags that trend of increased memory usage but a lower increase rate
Re the hypothesis that recording Event objects makes it leak faster, are you starting the measurement after the retention period has passed? Otherwise you may just be seeing the number of objects being stored increase.
(Alternatively: are there metrics for the number of objects under storage? controller-runtime might expose some for cache sizes, not sure. MBOI: https://github.com/linkerd/linkerd2/pull/6447)
I had a thought that our index was just growing without ever cleaning up, so I attempted to add some brute force cleanup logic here:
https://github.com/weaveworks/weave-gitops-enterprise/compare/jp-bleve-mem
Starting by these suspects
Issues that might be related
Looking at the flame graph we could see how half of the memory gets in the following function
That tracing up seems to be originated around clusters Manager
Given the issues and the evidence here, i could think that the suggested fix might help out here too:
@enekofb Could you document the troubleshooting process and how you are getting these nice visualizations? Maybe an entry in the eng-handbook
on the subject of profiling?
Created a couple of environment for dev purposes
in https://wge-3189-fix.eng-sandbox.weave.works
We could see this memory usage over the past few hours
in https://wge-3189.eng-sandbox.weave.works
We could see this memory usage over the past few hours
https://wge-3189-fix.eng-sandbox.weave.works/
past 24hs seems stable
with heap.zip
Next is ensuring that there is at least one leaf cluster (to have parity with staging)
With Cluster added we could see an (expected) increase of memory
and memory allocated to the tlsCache
leaving it for some time to see memory usage trends
@enekofb What changed to stabilize the mem usage?
end of day memory usage snapshot
environment https://wge-3189-fix.eng-sandbox.weave.works/
with a leaf-cluster wge-3189
current memory usage = we could see how still increase of memory usage but less steep than before
With the following heap.zip
where tlsCache is now 5% where before it was ~ 50%
and now the top memory usage are not necessarily clearly related to cluster connections
which means that a) it is controlled or b) we are not hitting the scenario that generated the leak
@enekofb What changed to stabilize the mem usage?
this for the tls related
@jpellizzari https://github.com/weaveworks/weave-gitops/pull/3983/files
b) we are not hitting the scenario that generated the leak
we are creating a leaf clusters in staging to exercise the cluster manager with a similar setup
https://wge-3189-fix.eng-sandbox.weave.works/clusters
looking to see how the picture looks like tomorrow
after last night we have
For https://wge-3189-fix.eng-sandbox.weave.works Memory usage still trending upwards
with lower memory footprint for transport.(*tlsTransportCache).get
compared to https://github.com/weaveworks/weave-gitops-enterprise/issues/3189#issuecomment-1703021599
which seems to indicate that: a. the change had an effect in terms of memory consumption but b. there are other places where the same leak occurs and/or c. there might be other leaks
following the trace for b)
Applied and released fix to cluster.WithFlowControl
this is how it looks in terms of top memory usage for Type: inuse_space
and this how looks for Type: alloc_space
given that is Type: alloc_space
945018 corresponds to the memory shown in grafana, it seems that we need to tackle the question on why the memory is not freed by GC?
Might be worth trying: https://stackoverflow.com/a/31754889
var transport http.RoundTripper = &http.Transport{
DisableKeepAlives: true,
}
Looking at
https://wge-3189.eng-sandbox.weave.works/clusters
which runs Weave GitOps Enterprise 2023-03 0.19.0
We see memory usage going up too so Explorer contributes but it definitely not only comes from Explorer
Looking at https://wge-3189-fix.eng-sandbox.weave.works/ (with explorer and cluster manager fixes included) we could see this memory usage (even what seems to be a dip in memory - gc ? )
And the heap
With allocated objects showing ~ 45% coming from indexer related
and allocated space like
chasing this trace
looking beyond heap in use to other memory usages we could see:
which throws the hypothesis that might not be heap but goroutines which holds the memory
we have disabled explorer and so far we can see:
goroutines
the hypothesis to observe: an ever-increasing memory usage is created due to an ever-increasing number of goroutines associated with explorer
After the weekend:
We could see the following:
Used memory keeps increasing so looking in deeper detail to the heap, what happens is:
This is not necessarily an issue per se unless however to have a unbounded memory usage does not seems to be a memory pattern that we want to keep as for example question like
Potential actions
Potential actions do nothing: sets a minimum memory request of 1GB and limits of 2GB for cluster service and observe
not acted
To increase metrics retention to see beyond 24hs (up to a week) to see how GC behaves
acted, set to 72hs
To determine operations allocations memory and objs and see whether we could use memory more efficiently: reuse memory to avoid requesting more heap
given allocations happens in the context of polling, have increased the polling frequency to a high number to see the effect in the allocations therefore in the memory
To Understand why GC / process does not release more memory gathering info about it:
After applying https://github.com/weaveworks/weave-gitops/pull/3983/commits/d18dc14ffa827ff1e860989fd88a8e366448f70c
we could see the following
a clear impact in the memory usage (that is increasing at slower rate)
and allocations that have reduced.
We could also see now that the allocations are coming from the same method but from progressive delivery
Obs, the increase of the period has an impact on detecting changes but we need to understand what a balanced values is.
Applied change to progressive delivery https://github.com/weaveworks/weave-gitops-enterprise/commit/083de2122927da96303d4c1e853aa330adab73c1
we could see the same trend:
Still unknown:
Suggested Actions at the back of the latest understanding:
After 72hs of data we could see that memory heap usage follows a trend upwards at slower rate than before
with the following heap
hypothesis to potentially follow up at the back of the previous and following graph:
a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)
b) to see whether the number of connections are also increasing and in that case whether we could reuse over create
a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)
enabled cached clients https://github.com/weaveworks/clusters-config/commit/088964122aed3a6eed8ec6ff266c98a9448d8068
with the following looking:
that indicates that the feature end up in memory and goroutine leaks
And heap.zip that indicates that
50% of the memory comes from secrets
with the following call chain
as a result, it didn't help to address the original problem so reverting
a) reduce number of connections coming from watchClusters (which is the major contributor to in-use object -> live objects from the previous graph)
given the following path of inuse objects
we disable flowControl to understand whether the leak comes from it
we could see an initial timewindow of 15 mins where the leaf clusters polling happens:
where we see that it results in an increase of heap memory (to be confirmed whether would be a trend to persist)
After 18hs hours we could see the following graph
with the following heap.zip
that indicates that:
cluster.getClientFromConfig
this is our next target
Summary of findings so far affecting resource consumption increase likely being a leak and therefore contributing to eviction:
Starting point:
As indicated by the ticket, we have WGE running that setting up some limits ends up in eviction.
Testing scenario
Tooling
Hypothesis and Findings
We walk the set of hypothesis and findings that you could read out of the comments in this ticket
H1: Memory leak happens around Explorer due to event management
H2: Memory leak happens around cluster manager due to a leak in tls cache
H3: Memory leak happens due to leak related to clusters polling
H4: Memory leak happens due to leak on clusters manager flow control feature
Issues found at the back of this investigation that contributes to leaks
Based on previous hypothesis we work out
that indicates that:
we still have memory heap increase and live objects increase at slower rate with the max cum memory from our calls are still coming from cluster manager cluster.getClientFromConfig that the hypothesis is that still create connections
By looking into monitoring clusters manage we found thet following metric create_server_client_total
that plotting in our dashboard we could see the increasing trend
which indicates another leak
We could see how there is cache for no creating clients all the times so debugging this part to see what is the failure.
H5: Memory leak happens due to leak on clusters manager server connections manager
Actions: https://github.com/weaveworks/weave-gitops/commit/d97c5015819289787cfcd6120a24e10e9e96ff8c https://github.com/weaveworks/weave-gitops/commit/c6cc497d3c09bcadf019236ec2be8cb08b7e7d02
Results: TBD
ttl wrongly set
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"created server client","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":30}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"found after set"}
{"level":"info","ts":"2023-09-20T10:28:43.688Z","msg":"set client","user":"weave-gitops-server","cluster":"flux-system/dev"}
W0920 10:29:04.889951 8 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta2.HelmRepository: helmrepositories.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:defa
ttl rightly set
Defaulted container "clusters-service" out of: clusters-service, mount-ui-server (init)
I0920 11:51:10.791411 7 request.go:690] Waited for 1.035823111s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/notification.toolkit.fluxcd.io/v1beta2?timeout=32s
{"level":"info","ts":"2023-09-20T11:51:12.125Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980,"ttl":86400}
{"level":"info","ts":"2023-09-20T11:51:12.260Z","msg":"set cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177,"ttl":86400}
now we could see connections reused
{"level":"info","ts":"2023-09-20T11:56:42.333Z","logger":"gitops-cluster-fetcher","msg":"Found clusters","clusters":["flux-system/dev","flux-system/prod"]}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/prod"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"management"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/prod","cacheKey":2572091869195149177}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.388Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"getOrCreateClient","user":"weave-gitops-server","cluster":"flux-system/dev"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"flux-system/dev","cacheKey":13171024161262237980}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"found cached client"}
{"level":"info","ts":"2023-09-20T11:56:42.389Z","msg":"get cached connection","user":"id=\"weave-gitops-server\" groups=[]","cluster":"management","cacheKey":14748168696684532311}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"client found"}
{"level":"info","ts":"2023-09-20T11:56:42.391Z","msg":"found cached client"}
We could see flat connections now and memory usage decreasing
Last 17hs shows the following picture
That pretty much shows stable:
We will leave it running to get more hours of data but if the trends remains we won't do more actions but recap and open follow-up tickets.
after the weekend we could confirm a picture without a ever-growing heap nor number of objects
wrapping up
https://weaveworks.slack.com/archives/CMDJFD3P0/p1691498803210609
I suspect this is a result of the recent PolicyAgent events retention logic added here: https://github.com/weaveworks/weave-gitops-enterprise/pull/2973