Memory Leak using woodpecker with kubernetes

lara-clink commented 1 month ago

Component

agent

Describe the bug

I’ve been encountering what appears to be a memory leak issue when running Woodpecker CI on a Kubernetes cluster. After running pipelines over time, I noticed that the memory usage of the Woodpecker agents and server steadily increases, eventually leading to performance degradation and, in some cases, the need for manual intervention to prevent the system from becoming unresponsive.

Steps to reproduce

Deploy Woodpecker CI in a Kubernetes environment. Run multiple pipelines continuously over an extended period. Monitor memory usage of the Woodpecker agents and server, I will attach my grafana memory usage graph. Notice that memory consumption increases over time without being released after pipeline execution.

Expected behavior

Memory usage should stabilize after pipeline executions are completed, and unused memory should be reclaimed properly.

System Info

Woodpecker Version: 2.7.0
Kubernetes Version: v1.29.4
Environment: Running Woodpecker on a Kubernetes cluster
Number of agents: 10

Additional context

I am using golang profiling to find something about it, this is what I could find so far:

woodpeckergraph

Has anyone ever faced an issue like this?

Validations

[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

zc-devs commented 1 month ago

Has anyone ever faced an issue like this?

Not me. But have no such a load (10 agents) :)

When did it start / what is the behavior on the previous versions? Have you tested on 2.7.1, next?
How to gather this pprof statistic? Is there some guide? I didn't find anything in the WP docs.
Nice pprof info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you had pprof stats from a mentioned agent.
What is the load? I mean WOODPECKER_MAX_WORKFLOWS and how many do you run simultaneously? Could you explain the right half of the Grafana chart? Something like:
- at this point we run 1 pipeline with 10 workflows
- at this point they all finished
- at this point we run another 10 pipelines with 1 workflow
- at this point they finished and there were no load at all for next 1 hour
What is the config of the Server? How much instances? What's about database? What is the load on Server and database?
Where do you store the pipeline (steps) logs?

lara-clink commented 1 month ago

Hey @zc-devs , we are currently working on our migration project (automated migration from Drone CI to Woodpecker) and I could not collect all of the answers for you yet. By the end of this week I should be able to come back to that.

lara-clink commented 1 month ago

Has anyone ever faced an issue like this?

Not me. But have no such a load (10 agents) :)

When did it start / what is the behavior on the previous versions? Have you tested on 2.7.1, next?

How to gather this pprof statistic? Is there some guide? I didn't find anything in the WP docs.

Nice pprof info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you had pprof stats from a mentioned agent.

What is the load? I mean WOODPECKER_MAX_WORKFLOWS and how many do you run simultaneously? Could you explain the right half of the Grafana chart? Something like:

at this point we run 1 pipeline with 10 workflows

at this point they all finished

at this point we run another 10 pipelines with 1 workflow

at this point they finished and there were no load at all for next 1 hour

What is the config of the Server? How much instances? What's about database? What is the load on Server and database?

Where do you store the pipeline (steps) logs?

We started to use woodpecker in 2.3.0, and since that we are facing memory leak issues, so we can not know since which version the problem occurs. We have not tested later versions since 2.7.0;
We ran a forked version from 2.7.0 I used this tutorial to do it: https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz;
There you go:
The WOODPECKER_MAX_WORKFLOWS is 10 and we have 15 pods, so it is 150 workflows simultaneously. But the grafana just shows that memory usage increases as we still use Woodpecker. The low points means just that we had a deployment and the pods restarted;
memory: 4Gi requests: cpu: '2' memory: 4Gi

zc-devs commented 1 month ago

Thank you for the guide. Sadly, it's not so convenient to patch and build own version. Could you make a PR with pprof functionality? It should be optional, so flag like WOODPECKER_PPROF_ENABLED: true|false. It would be helpful in the future for all users.
What are the versions of
```
k8s.io/api
k8s.io/apimachinery
k8s.io/client-go
```
in your fork? Have you tried to update it?

Entertaining discussion. Even shared informer has been mentioned.

lara-clink commented 1 month ago

those are: k8s.io/api v0.30.2 k8s.io/apimachinery v0.30.2 k8s.io/client-go v0.30.2

and we have not tried updating it yet

lara-clink commented 1 week ago

Hey @zc-devs , we did some tests updating the libs, first we did on our woodpecker-beta environment and now on production. We got similar results on both tests: This is the memory usage graph before updating the 3 libraries for v0.31.2 version: Captura de Tela 2024-11-18 às 16 07 42 And this is after the update: Captura de Tela 2024-11-21 às 09 12 33 As you can see, we are still having memory leak behavior since the pod allocates memory and never releases it all. But we had some changes in our profiling results:

Now only k8s.io/apimachinery shows when we run the "top" command, here is the graph we get when we run the "web" command: pprofwoodpecker In conclusion, we think that the issue is in the k8s.io/apimachinery library.

zc-devs commented 1 week ago

we are still having memory leak behavior since the pod allocates memory and never releases it

388505256-c70db25f-ac48-4a9c-81e2-4301674d60ba

Shared informers use cache. So, 10 MB could be Woodpecker Agent itself, and 30 MB could be the caches filled at first pipeline run (10+30=40).

Showing nodes accounting for 25 MB, 90% of 28 MB total

Do I miss something or you are trying to measure the leak at the start of the Pod? Perhaps, I do not understand how pprof works, then correct me, please.

Nice pprof info, but this screens from an agent, which allocated 44.36 MB of memory, if I understand correctly. However, Grafana shows memory usage around 1 GB and that is the issue (I suppose). It would be nice if you had pprof stats from a mentioned agent.

^ is still valid. Could you get pprof info when Agent takes GIGAbytes of memory: 376330232-0c4fdc5c-bb6b-498f-855a-bd81a7d116eb

lara-clink commented 1 week ago

The graph does not have any relation with pprof, the point I am making with those graphs is that the memory is never fully released, as you can see in the first graph in your comment, the second release does not reach the same point as the first, it is always a little bit higher.

woodpecker-ci / woodpecker