woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.07k stars 353 forks source link

Autoscaler #999

Closed kdumontnu closed 6 months ago

kdumontnu commented 2 years ago

Clear and concise description of the problem

As a potential user of Woodpecker I would really like to be able to provision my agent servers on an "as-needed" basis, without having to support Kubernetes.

Suggested solution

Support a Woodpecker-Autoscaler image, which will accept user credentials to spin-up and shut down agent instances as necessary.

Roadmap:

Support major cloud providers:

Alternative

I believe Kubernetes is currently proposed as the alternative to autoscaler for woodpecker, but requires all of the infrastructure (and cost) associated with that.

Additional context

I know this has been discussed at many points before on discord, but I'm not sure if it's been determined that it's out of scope or not. It would be helpful to track that discussion here.

Validations

anbraten commented 2 years ago

Some questions we have to answer in general:

mr337 commented 2 years ago

I'm not for sure how relevant this is, but BuildKite [1] has something similar for AWS but leverages external tooling to do so. The BuildKite service has an API that an AWS Lambda polls every minute and asks for pending jobs. If there are pending jobs it increase the count on an autoscale group. A new instance is fired, agent connects, and puck up the pending job. That handles the autoscaling increment operation.

The other part is if the agent is idle for X minutes it will then shut itself down, as part of shutting down it will decrement the autoscale group.

I think this design is pretty nice as it has the benefits of:

[1] - https://github.com/buildkite/buildkite-agent-scaler

kdumontnu commented 2 years ago

Some questions we have to answer in general:

  • Is there maybe a go package Woodpecker could use or would it require to write some extendable system to support all kind of hosters?

I'm not sure about a unified package, but I don't think wrapping the various go packages from different providers should be terribly difficult (ex. google.golang.org/api/googleapi, github.com/aws/aws-sdk-go/aws, ....). I think if we build an autoscaler for one or two services, the community could very easily add support for others as needed.

edit: maybe using terraform could help https://github.com/hashicorp/go-tfe

  • Which metrics does it need to decide about rescaling.

From woodpecker host it would need:

For setup configs it would need:

From the provider, it will need to poll/sync:

Hopefully this helps (I'm probably missing a lot of interfaces, but maybe we can update this as we learn more)

lafriks commented 2 years ago

This should probably be created as separate service/repo to make it more easy to maintain and not to bring too many dependencies into core

6543 commented 1 year ago

how to do it externally :) -> https://github.com/windsource/picus

6543 commented 1 year ago

I would say we should wait for woodpecker-ci/woodpecker#1189 ... and if we have it we can calculate if and when we would need to start or stop new agent instances ...

windsource commented 1 year ago

In order to have an external autoscaler service, two things are currently missing in Woodpecker from my point of view:

Detect those agents that are idle

In case the external autoscaler as spinned up several agents and there are later less pending jobs, then the autoscaler should stop one or more agents. But in this case those agents need to be detected that currently do not have a job running. As the agents do not have an API to use from an external service (I think) the Woodpecker server is the only point of contact. I have already checked the API /api/queue/info but it does not provide any information about the agent host where a pipeline is running. Could that be information be retrieved from extern in any other way?

Mark agent to not schedule jobs anymore

Before an agent is stopped the server should be told not to schedule any more jobs on that agent. In case this is not done there could be a race condition such that there are no jobs running on that agent, the external autoscaler stops the agent but the server has already scheduled a job on that agent in the meantime. That would cause the job being interrupted until it runs in a timeout (?). Could a corresponding API be added to the Woodpecker server (or agent)?

Note: I am the author of Picus and currently thinking about to extend Picus to a real autoscaler for Woodpecker.

anbraten commented 1 year ago

@windsource Great suggestions. I started to add an agents list in woodpecker-ci/woodpecker#1189 which would be the first step into that direction to allow the server to identify an agent. (I am mainly needing another maintainer to review this)

After that PR got merged adding a link between agents and queue entries should be easily doable. The do not schedule flag is a great idea as well. Would also be possible after to add after woodpecker-ci/woodpecker#1189.

windsource commented 1 year ago

An other feature might be useful here as well:

Assign priorities to agents

I wonder how Woodpecker server schedules jobs on agents when there is more than one agent available. Let's assume the following situation: There are 3 static agents (maybe on premise) already connected and the external autoscaler is able to dynamically start more agents on high load situations in the cloud. When load gets less the remaining jobs should be scheduled on the static agents with priority such that the cloud agents can be stopped again. When the user can provide a parameter 'priority' to each agent that could be possible.

@anbraten what do you think about that?

anbraten commented 1 year ago

@windsource Not sure if that is really needed. At least for now I would put it on the long ideas list 😉

gmuellerinform commented 1 year ago

Hi! Since woodpecker is already using containers for everything, why not simply use containers for autoscaling too? Call the autoscaling container for every pending and finished job and provide some info, e.g. total number of pending jobs etc. as env variables. The container can then decide by itself what to do. I would like to call my terraform scripts for example. Others might just call some web api. But this way there could be autoscaling plugins. Just an idea..

anbraten commented 1 year ago

FYI with woodpecker-ci/woodpecker#1631 we should soon be able to properly know which agent has nothing to do and can be removed.

anbraten commented 1 year ago

I will start creating an external autoscaler service written in golang (to be able to share some code) over at https://github.com/woodpecker-ci/autoscaler. If anyone is interesting in helping pls reach out to me.

@windsource I really like what you started with picus ❤️ , maybe you are interested in working on the go implementation as well.

6543 commented 1 year ago

well I was going to implement it as project for my bachelor thesis this semseter ...

windsource commented 1 year ago

Hi @anbraten, that's good news. Currently Picus is only able to scale a single agent up and down but I already thought about how to handle more agents but did not have the time to implement that yet. Maybe we can exchange some ideas about the autoscaler in golang.

The tricky part is startig the agent. In Picus I used different methods dependent on the cloud provider. In AWS you do no pay for stopped instances (except for the block storage) so I created one instance at beginning which is then started and stopped by the autoscaler. The advantage of that solution is, that build images are already present in the block storage and do not need to be pulled again. Also the agent gets started very quick. The disadvantage is, that it does not scale to more than one agent (except if you have prepared multiple agents like that). I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

For Hetzner cloud the autoscaler starts a new instance and I use cloud init to setup the instance with docker compose and and the woodpecker agent image. In that case all images (also build images) need be pulled when the agent is started. That also works quite well but puts some load on the container registries.

For AWS that method could be used as well to scale for more than one agent. One could pre-generate an AMI and use an autoscaling group which is then configured by the woodpecker autoscaler. An alternative for AWS would be instance templates. Not sure if this would be better.

Regarding build images already being available on the agent: If there is more that one agent probably we cannot guarantee that a single project is always assigned to the same agent, or can we and how? Of course there are labels but if the agents all have the same config?

anbraten commented 1 year ago

I've already started the basic scaling logic a bit. Most interestingly is probably the code how I calculate the diff for new / less agents at the moment (not sure if it is the best approach): https://github.com/woodpecker-ci/autoscaler/blob/457c0d0545157c03a91829ef844e9d6b322685d2/main.go#L146-L174 Its basically getting the list of free workers (agents * WOODPECKER_MAX_WORKFLOWS) and the list of pending tasks limited by the min and max agents sizes and then tries to add / remove agents to get closer to that amount. Removing agents is currently implemented by setting the do not schedule new tasks flag. At the end some kind of garbage collection is removing all agents which are disabled and do not execute any workflows.

I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

It is "only" some kind of cache then, so it should definitely work without at least in the beginning. But especially in view of a conscious and sustainable use of computing resources we should optimize this flow later on.

For Hetzner I saw that this kubernetes autoscaler is creating an initial snapshot and creates new nodes using that snapshot: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner#autoscaling-node-pools

I guess in general the kubernetes autoscaler project could provide us some nice insights how to do scaling and how the cloud-providers could be used: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

CC @windsource

kolaente commented 1 year ago

Only having skimmed this read, want about forking https://github.com/drone/autoscaler

anbraten commented 1 year ago

Biggest problem is its license 🙈

mvdkleijn commented 1 year ago

Can we get Linode on the list of cloud providers? :yum:

anbraten commented 1 year ago

The first container image for the autoscaler (atm just with Hetzner being added as the first cloud provider) was just released: https://github.com/woodpecker-ci/autoscaler/pull/1

https://hub.docker.com/r/woodpeckerci/autoscaler

Would be nice if some of you can test it and provide feedback or maybe even start on adding new cloud providers 😉

guisea commented 1 year ago

@anbraten Love the autoscaler. The core and calculations look sound and do what is needed. Now running a self built image including the linode driver I created. It is doing its thing without issue.

anbraten commented 1 year ago

@guisea Aweseome. Thanks for the feedback.

anbraten commented 6 months ago

Closing this one as we will track further development in the autoscaler repo: https://github.com/woodpecker-ci/autoscaler