woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.07k stars 353 forks source link

Agents cleaning #3023

Closed zc-devs closed 6 months ago

zc-devs commented 8 months ago

Clear and concise description of the problem

Main problem described in #2027.

After roll out Unregister stateless agents from server on termination, abandoned agents still present (some agents do not unregister themselve, because network issue, for example):

Screenshot 2023-12-26 142345

Suggested solution

So, this is request for cleaning scheduled task. Something like Cron jobs in Gitea for maintenance activity.

Alternative

No response

Additional context

No response

Validations

qwerty287 commented 8 months ago

I don't really think we should do this. You can't know if the agents are used later again, even if their last contact is months ago - we don't know about the usecases.

Implementing this in a standalone command should be pretty easy though using the APIs, so I'd rather see something like this as solution for the issue.

zc-devs commented 8 months ago

You can't know if the agents are used later again, even if their last contact is months ago - we don't know about the usecases

In the case of #2027 they surely won't connect ever. Of course, by default this should be disabled and who wants it, set up "retention policy".

using the APIs

I've thought about this too, but I considered it as a temporary solution.

Edit 1: If this is desirable only on Kubernetes, then we probably should leave server as is and make CronJob in K8s.

qwerty287 commented 8 months ago

I've thought about this too, but I considered it as a temporary solution.

I wouldn't say it's a temporary solution - why? That's one of the reasons why APIs are so cool because you can extend the software to your needs even if other users probably won't use your way. We can add the tool to https://woodpecker-ci.org/docs/next/awesome afterwards so others can find it too (if you don't want to write the tool - just ask, I can do this, will probably only cost a few minutes).

Edit 1: If this is desirable only on Kubernetes, then we probably should leave server as is and make CronJob in K8s.

I think this applies to all backends.

pat-s commented 8 months ago

I'm with @zc-devs, this should be done on the app-side and not via a user-required API action. If we start with this, we end up with a dozen actions which must be manually configured by users after deploying WP to have a "clean" deployment.

(I also just cleaned up hundreds of agents a few days ago)

I view this as an internal "garbage collection" action. Otherwise, the DB and UI will just grow infinite with unused agents (making at least the UI view unhelpful).

A "smart" solution could be to check for agents older than a specific age (which ideally can be configured) and unregister these. A default of "1 day" might be a good default.

qwerty287 commented 8 months ago

If we start with this, we end up with a dozen actions which must be manually configured by users after deploying WP to have a "clean" deployment.

I see your point, but I don't think most users have this issue.

Maybe this was also explained here already, but why can't you reuse your agents using the agent tokens? If you use the autoscaler or similar, the old agents should be removed anyways (otherwise it's a bug there).

pat-s commented 8 months ago

I see your point, but I don't think most users have this issue.

Any k8s user will face this and while I don't know the numbers, their relative share will likely increase.

Maybe this was also explained here already, but why can't you reuse your agents using the agent tokens?

You can (and likely should) but right now each new container creates a new agent, likely because it creates a new agent token during startup (?). And containers get replaced/recreated frequently in k8s. I checked the agent helm chart but there nothing is done WRT to token creation, so it must be happening inside the default agent container. Maybe there is an easy way/fix to store agent tokens in the DB and reuse them during container startup (if all other config options stay the same)?

qwerty287 commented 8 months ago

Maybe there is an easy way/fix to store agent tokens in the DB and reuse them during container startup (if all other config options stay the same)?

There's WOODPECKER_AGENT_CONFIG_FILE config which should be used to persist the token. If you mount that into your container the same token and thus the same agent should be used every time. (If you know this, but it won't solve the issue, I probably just don't know about the issue as I can't say something about k8s.)

zc-devs commented 8 months ago

why can't you reuse your agents using the agent tokens?

needs deployment per agent

And we don't make deployment per agent, we just set replica count:

kind: Deployment
spec:
  replicas: 100500 # number of agents
  template:
    spec:
      containers:
        - name: woodpecker-agent
          image: quay.io/woodpeckerci/woodpecker-agent:v2.1.1
          env:
            - WOODPECKER_AGENT_SECRET: which token should be here? (system)

because it creates a new agent token during startup?

We use System token. And as we do not use Statefulset / do not persist ID -> new ID is generated at next registration <- repeat as much as you restart pod (before #2606, after - sometimes)

config which should be used to persist the token

Not token, but ID as far as I know. It makes sense for system token. Agent token is kind of ID by itself (works without config file).

qwerty287 commented 8 months ago

And as we do not use Statefulset / do not persist ID

But why? Is this not possible with kubernetes?

Not token, but ID as far as I know. It makes sense for system token. Agent token is kind of ID by itself (works without config file).

Yes, sorry, I was wrong, but it's the same effect.

zc-devs commented 8 months ago
  1. I don't feel that it is a stateful app. Rather it was made artificially (config with ID).
  2. There are cases when it breaks anyway. For example, I run additional Agent on my notebook from time to time to offload my main "server". Let's say it runs on Docker. I clean all containers, volumes and cached images from time to time. Here we are - the same problem. Some people want single-use Agent even.
  3. Warn: it is rude. App should take care of its ... garbage, let's say, despite of having cool API (or whatever else). They could say we have an API, make it yourself. But they didn't.
  4. I think WP will need Cron Jobs at some point in the future anyway. Whether it will be Agents cleaning or something else.
pat-s commented 8 months ago

Agree with @zc-devs. Also, other workflow-like apps like AWX or self-hosted Terraform apps don't leave persistent traces of their runners behind and work in a stateless fashion.

So where are we now in this discussion? @anbraten thumbed up qwertys comment, if the positions are still the same, it will be hard to implement a change here or continue arguing.

While this might be an issue which mainly affects k8s environments, it is a proxy discussion for general garbage collection within WP and hence quite an important one direction-wise. Also FYI @6543 due to this.

zc-devs commented 8 months ago

Meantime, you can use this Cron Job.

qwerty287 commented 7 months ago

Can I ask about what should happen with this again? Maybe some more @woodpecker-ci/maintainers could state what they think? Especially @anbraten as you thumb'ed up my comment.

pat-s commented 7 months ago

There are two approaches for solving this:

Both could be implemented asynchronously and would target partly different issue areas but would solve the problem at hand.

anbraten commented 7 months ago

I am mainly unsure if we could find / allow the user to set good values for the condition when an agent should be deleted by the proposed task. They could be quite complex depending on the setup and as the api already allows to implement own rules I would stick to using it for now.

pat-s commented 7 months ago

NB: The same issue is currently also being discussed for the act-runner implementation in the Gitea helm chart, see https://gitea.com/gitea/helm-chart/pulls/596#issuecomment-761996.

One option would of course be to add persistence (as mentioned already). But this doesn't necessarily imply not cleaning up internally or to necessarily use a statefulset.

I guess using persistence in the agent chart would be the quickest and most simple way to find a solution here. The issue is somewhat pressing as for many k8s environments, the list of agents grows quickly and there is no way of a simple batch cleanup without using a looped API call.

anbraten commented 7 months ago

Just an idea for k8s: could we update the config-map using the kube api to contain the new agent-id / updated secret?

dessalines commented 7 months ago

For some reason (probably when my server was more open), I have thousands of agents with capacity of -1 and never for online. There are so many that I can't even scroll to the bottom. Deleting them individually would be incredibly tedious, and I'm guessing my actual agents are at the very bottom of this list.

Isn't there any way to clean them?

zc-devs commented 7 months ago

Is there any way to clean them?

postgres=# \c woodpecker
woodpecker=# delete from agents;
dessalines commented 7 months ago

Will my currently functioning agents be able to reconnect and be re-populated in that table if I do that tho?

zc-devs commented 7 months ago

Make a backup. Yes, if you use system token and restart agents. Have not tested with dedicated agent token, but should work also.

zc-devs commented 7 months ago

could we update the config-map using the kube api

Why not? I was able to create Secrets. So, should not be problem with ConfigMap.

config-map contains the new agent-id

Don't get how it is related... OK, we created ConfigMap with ID. If we set replica: 100, then 100 Agents will have the same ID...

zc-devs commented 6 months ago

I think WP will need Cron Jobs at some point in the future anyway. Whether it will be Agents cleaning or something else.

You won't believe 🤣

1068

pat-s commented 6 months ago

Should be fixed with https://github.com/woodpecker-ci/helm/releases/tag/1.2.0.