Open lara-clink opened 1 month ago
Has anyone ever faced an issue like this?
Not me. But have no such a load (10 agents) :)
2.7.1
, next
?pprof
statistic? Is there some guide? I didn't find anything in the WP docs.pprof
info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you had pprof
stats from a mentioned agent.WOODPECKER_MAX_WORKFLOWS
and how many do you run simultaneously?
Could you explain the right half of the Grafana chart? Something like:
Hey @zc-devs , we are currently working on our migration project (automated migration from Drone CI to Woodpecker) and I could not collect all of the answers for you yet. By the end of this week I should be able to come back to that.
Has anyone ever faced an issue like this?
Not me. But have no such a load (10 agents) :)
- When did it start / what is the behavior on the previous versions? Have you tested on
2.7.1
,next
?- How to gather this
pprof
statistic? Is there some guide? I didn't find anything in the WP docs.- Nice
pprof
info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you hadpprof
stats from a mentioned agent.- What is the load? I mean
WOODPECKER_MAX_WORKFLOWS
and how many do you run simultaneously? Could you explain the right half of the Grafana chart? Something like:
- at this point we run 1 pipeline with 10 workflows
- at this point they all finished
- at this point we run another 10 pipelines with 1 workflow
- at this point they finished and there were no load at all for next 1 hour
- What is the config of the Server? How much instances? What's about database? What is the load on Server and database?
- Where do you store the pipeline (steps) logs?
We started to use woodpecker in 2.3.0, and since that we are facing memory leak issues, so we can not know since which version the problem occurs. We have not tested later versions since 2.7.0;
We ran a forked version from 2.7.0 I used this tutorial to do it: https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz;
There you go:
The WOODPECKER_MAX_WORKFLOWS is 10 and we have 15 pods, so it is 150 workflows simultaneously. But the grafana just shows that memory usage increases as we still use Woodpecker. The low points means just that we had a deployment and the pods restarted;
memory: 4Gi requests: cpu: '2' memory: 4Gi
WOODPECKER_PPROF_ENABLED: true|false
. It would be helpful in the future for all users.k8s.io/api
k8s.io/apimachinery
k8s.io/client-go
in your fork? Have you tried to update it?
Entertaining discussion. Even shared informer has been mentioned.
those are: k8s.io/api v0.30.2 k8s.io/apimachinery v0.30.2 k8s.io/client-go v0.30.2
and we have not tried updating it yet
Hey @zc-devs , we did some tests updating the libs, first we did on our woodpecker-beta environment and now on production. We got similar results on both tests: This is the memory usage graph before updating the 3 libraries for v0.31.2 version: And this is after the update: As you can see, we are still having memory leak behavior since the pod allocates memory and never releases it all. But we had some changes in our profiling results:
Now only k8s.io/apimachinery shows when we run the "top" command, here is the graph we get when we run the "web" command: In conclusion, we think that the issue is in the k8s.io/apimachinery library.
we are still having memory leak behavior since the pod allocates memory and never releases it
Shared informers use cache. So, 10 MB could be Woodpecker Agent itself, and 30 MB could be the caches filled at first pipeline run (10+30=40).
Showing nodes accounting for 25 MB, 90% of 28 MB total
Do I miss something or you are trying to measure the leak at the start of the Pod? Perhaps, I do not understand how pprof
works, then correct me, please.
Nice pprof info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you had pprof stats from a mentioned agent.
^ is still valid. Could you get pprof
info when Agent takes GIGAbytes of memory:
The graph does not have any relation with pprof, the point I am making with those graphs is that the memory is never fully released, as you can see in the first graph in your comment, the second release does not reach the same point as the first, it is always a little bit higher.
Component
agent
Describe the bug
I’ve been encountering what appears to be a memory leak issue when running Woodpecker CI on a Kubernetes cluster. After running pipelines over time, I noticed that the memory usage of the Woodpecker agents and server steadily increases, eventually leading to performance degradation and, in some cases, the need for manual intervention to prevent the system from becoming unresponsive.
Steps to reproduce
Deploy Woodpecker CI in a Kubernetes environment. Run multiple pipelines continuously over an extended period. Monitor memory usage of the Woodpecker agents and server, I will attach my grafana memory usage graph. Notice that memory consumption increases over time without being released after pipeline execution.
Expected behavior
Memory usage should stabilize after pipeline executions are completed, and unused memory should be reclaimed properly.
System Info
Additional context
I am using golang profiling to find something about it, this is what I could find so far:
Has anyone ever faced an issue like this?
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]