woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.25k stars 369 forks source link

Delete old pipeline logs after X days or Y new runs #1068

Open qwerty287 opened 2 years ago

qwerty287 commented 2 years ago

Clear and concise description of the problem

From https://codeberg.org/Codeberg-CI/feedback/issues/63:

It would be nice to have an option to delete old logs of pipeline entries. Either automatic (delete logs of succesful runs after X days or Y new runs?) or with some sort of checkbox selection and action?

Suggested solution

An option that automatically deletes old pipeline logs

Alternative

No response

Additional context

No response

Validations

maxkratz commented 1 year ago

Indeed, an option to manually delete pipeline runs (for example, after debugging the CI config on a testing branch) would be nice to have.

anbraten commented 7 months ago

I would like to add support for automatic deletion of old pipelines. Therefore I would suggest the "keep max last x pipelines" approach as it seems to be covering the issue (I want to keep the database size to a minimum by deleting unused content) and seems to be the easiest solution to implement keeping things simple. This would make special api endpoints as requested in #3506 obsolete for now.

My plan would be the following:

xoxys commented 7 months ago

I don't get it. Why not keep the API first approach? Why would you hardcode this feature instead of a reusable API?

anbraten commented 7 months ago

I don't get it. Why not keep the API first approach? Why would you hardcode this feature instead of a reusable API?

How should any other users benefit from this feature, will you open-source it as a service? If the plan is to introduce it as an internal feature later on, wouldn't those new api endpoints be unnecessary? I guess we don't plan to call the api endpoints from the code?

xoxys commented 7 months ago

Still don't understand your point...

How should any other users benefit from this feature, will you open-source it as a service?

Of course not. Why should I create an external service? I just said, let's keep the PRs small, implement the API first and let's create a second PR that implements the UI components based on the API. I would have also added a CLI command to e.g. run woodpecker pipeline prune --older-than 30d based on the API. This cli command can be executed manually whenever needed or as a scheduled job.

Nobody has said that we don't want proper UI integration. As I wrote in the PR, the API can be called from the UI (that's how the UI works in general, right?). The API can also be called from the CLI (the cli client uses the go sdk, which only makes simple API calls).

If this is not true, why do we provide APIs at all? Creating an internal-only function prevents manual cleanups whenever a user wants to do so with any scope they want to clean up. If you don't plan to create a reusable API first, what should the UI implementation even look like? Nevertheless, I thought Woodpecker is following an api-first approach. This means that users are able to write their own CLI clients or even their own web front-ends if they want to. That's why I don't understand why you don't want a public API (especially if such exists already that exposes a lot of Pipeline/Repo related operations).

pat-s commented 7 months ago

Reading this thread, @xoxys reasoning makes a lot of sense to me (having API endpoints which can be called from the UI and CLI) and would also cover the intentions of @anbraten AFAICS.

@anbraten Can you eloborate what the downsides would be in your view to the approach of @xoxys?

anbraten commented 7 months ago

I thought about adding an api in the beginning as well. However to have an automatic deletion / retention of pipelines this has to be integrated into some kind of routine. At first there was the idea to have it executed by a cron scheduler, but then someone in chat suggested to simply remove old pipelines each time a new pipeline is created which seems pretty smart in terms of not checking inactive repos on large instances. This would make an api for this kinda obsolete. The main question remaining from what I read is if a keep x pipelines setting or a keep pipelines not older than x days setting is more useful for the majority of users.

Keep x pipelines seems to be nice as projects often have a different amounts of activity over time. Based on our UI most users are probably looking at sth like the last 10 pipelines in the UI. So keeping the 100 latest pipeline could do the job. If there is no activity for a year or so on a project you would still have some pipelines once you come back to the project.

Keeping pipelines for x days could make more sense for legal requirements, but I haven't heard of someone that this is a thing for CI pipelines yet.

xoxys commented 7 months ago

You still just ignore my main points.

If you don't plan to create a reusable API first, what should the UI implementation even look like?

Is not answered.

This would make an api for this kinda obsolete.

No because users should still have the ability to delete pipelines whenever they want and the amount they want. I can't understand why you try to avoid an API at all costs. However, I have outlined and explained my points pretty clear multiple times now.

xoxys commented 7 months ago

Keep x pipelines seems to be nice as projects often have a different amounts of activity over time. Based on our UI most users are probably looking at sth like the last 10 pipelines in the UI. So keeping the 100 latest pipeline could do the job. If there is no activity for a year or so on a project you would still have some pipelines once you come back to the project.

I have chosen the time bases approach for the API because it can handle "Keep x pipelines" as well. If we prefer this for e.g. the Woodpecker UI, just do:

If some people want to use a time-based approach, they can use the same API while doing it the other way around (implementing an API to "Keep x pipelines") prevents any time-based (e.g. older than 30d) deletion.

In the real world scenario, IMO time-based is preferred in general because "Keep x pipelines" has a drawback. Let's say someone is spamming your repo with PR's (by accident or as an "attack") while "Keep 100 pipelines" is configured. You will lose your build history in seconds (yes, one could enabled required approvals).

anbraten commented 6 months ago

You still just ignore my main points.

If you don't plan to create a reusable API first, what should the UI implementation even look like?

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job. Therefore a setting like the following which enables an automatic cleanup would be sufficient for the majority:

image

If a user still has specific needs the /repos/{repo_id}/pipelines endpoint could be used to receive a list of pipelines and we could consider adding a generic endpoint for deleting just a single pipeline as suggested before. For even more specific needs there is always the option to manually interact with the database directly.

xoxys commented 6 months ago

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job.

You are making assumptions about what users want instead of letting them decide for themselves. As I am also a user, I can tell you that I do not want such a setting, and it would not help me at all.

Therefore a setting like the following which enables an automatic cleanup would be sufficient for the majority:

I tried to find a middle ground and clearly showed how this can also be achieved with the proposed API. I was also willing to do exactly what you want for the web UI (just using APIs instead)... That way everyone would get something out of it: you get the setting in the UI, and we get an API. However, I'm out at this point đź‘‹.

zc-devs commented 6 months ago

What a hot discussion! It's funny and sad at the same time...

TLDR: go on and implement it in your way. Who wants API/whatever will implement it on top of your service.


why you try to avoid an API at all costs

I doubt it is the main reason. The main is avoiding scheduling service, I think. It is and it was. I don't understand why though... If one has concerns in regards HA, then

  1. It doesn't work now anyway;
  2. We can develop separate micro service as Harbor did Job service, for example.

The second one is Why one should implement functionality, which is orthogonal to what he wants/needs (requirements)? Why orthogonal? Because current feature requires just one integer - the setting of retention policy - not low-level API.

would like to add support for automatic deletion of old pipelines

Note automatic.

I have chosen the time bases approach ... just do ...

Who? Me? Manually? ^ Automatic.

users are able to write their own CLI clients or even their own web front-ends

What a beautiful definition of a user!

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job

Exactly! I'm in that bucket of users. We do not consume cool Application Programming Interfaces, nor write frontends. We use GUI (CLI at worst).

I can tell you that I do not want such a setting, and it would not help me at all

Don't want - don't use. It would not help me either, but it might help somebody.


Anbraten are going to write internal service (file/class) to clean up logs. That service will be triggered by pipeline creation (I prefer cron, as you know:). There will be retention policy setting in GUI. Who wants to expose that service via REST API, GRPC, CLI / whatever are welcomed, I guess.

If a user still has specific needs the /repos/{repo_id}/pipelines endpoint could be used to receive a list of pipelines and we could consider adding a generic endpoint for deleting just a single pipeline as suggested before. For even more specific needs there is always the option to manually interact with the database directly.

But it's a different task.

xoxys commented 6 months ago

You missed the entire point...

zc-devs commented 6 months ago
  1. That is sad, but not related to implementation in discussion.
  2. --
  3. Your API is completely different to what current implementation requires.
  4. I doubt. Are you suggesting to run this algorithm in frontend? When, considering automatic aspect?
  5. It is completely different features.
xoxys commented 6 months ago

Having an API to delete pipelines is a different feature than having a UI setting to do it? Why do we have an API to delete a Repo if we have a button in the UI to do it? And what is the red button "Delete Repo" doing? It calls the DeleteRepo API.

And thats exactly what Im talking about.

xoxys commented 6 months ago

I doubt. Are you suggesting to run this algorithm in frontend? When, considering automatic aspect?

I agree the UI might be the wrong place to automate it.

zc-devs commented 6 months ago

Having an API to delete pipelines is a different feature than having a UI setting to do it?

Having an API to delete pipelines is a different feature than having an UI to set a retention policy / setting up a trigger.

Also, I've just read through #3506:

Note: These APIs are about deleting the entire pipeline, not only it logs.

And this issue title:

Delete old pipeline logs


deleting the entire pipeline

Don't want to search, but there was concern about audit.

Why do we have an API to delete a Repo if we have a button in the UI to do it? And what is the red button "Delete Repo" doing? It calls the DeleteRepo API. And thats exactly what Im talking about.

I understood it. If Anbranten wanted just button to delete pipeline/s (or logs?;), then your points are absolutely valid.

xoxys commented 6 months ago

Having an API to delete pipelines is a different feature than having an UI to set a retention policy / setting up a trigger.

You are right.

Don't want to search, but there was concern about audit.

If that would have been the mail problem I could understand it as well but nobody said something like this not in the PR not in this discussion.

Anyway I agree it might be better to separate these topics.

zc-devs commented 6 months ago

I've been thinking for a while...

Hooking into pipeline lifecycle is fine if we could decouple it. Instead of calling directly LogsCleaningService from PipelineService, we could emit event PipelineCreatedEvent and handle it in cleaning service.

  1. LogsCleaningService is singleton, loads configuration (retention policy) at startup.
  2. Pipeline runs. At some point it is saved to DB in PipelineService (should already exist, didn't look for a name in the code).
  3. Right after that PipelineCreatedEvent is created/emitted.
  4. LogsCleaningService catches event, gets repo ID.
  5. Loads (via PipelineService probably) pipeline runs IDs by condition from retention policy: older than X days / X count.
  6. Calls method (supplying IDs from previous step) of PipelineService (or maybe some dedicated to logs exists PipelineService) to remove log entries like removeLogsOf(int[]{1,2,3}).

Looking forward to implementation of this feature, cause I could apply this approach to the agents cleaning also.