Create self-hosted runner for integration(-ish) CI tests

safoinme commented 1 year ago

Introduction

This pull request (PR) addresses a long-standing challenge we've encountered with the K3d stack recipe. Specifically, our previous testing process on GitHub Actions fell short due to the resource-intensive nature of provisioning a K3d cluster and installing various applications.

To overcome this hurdle, we've introduced a solution leveraging GitHub's Self-hosted runners. These self-hosted runners grant us the flexibility to execute GitHub Actions workloads within our own custom environments, offering greater control and adaptability.

However, we are mindful of cost considerations and the environmental impact of maintaining VMs that run continuously. To address this, we've integrated Terraform into our workflow. With Terraform, we can dynamically provision VMs only when needed for testing purposes and efficiently de-provision them once testing is complete.

This PR represents a significant improvement in our testing infrastructure, allowing us to ensure the reliability and performance of the K3d stack recipe without incurring unnecessary costs or resource wastage. We look forward to your feedback and collaboration to further enhance our development process.

A full detailed document about this can be found here

safoinme commented 1 year ago

@strickvl Regarding the questions:

What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.
We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

strickvl commented 1 year ago

@strickvl Regarding the questions:

What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.

I'd suggest you add one way to indicate how you think this should be used.

We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

Yeah it just felt a bit weird to have them running in separate workflows.

Also followup questions:

what's the failure fallback here? what happens when something gets partially provisioned? what happens when a test fails and/or the destruction doesn't take place?
how does this work when two PRs are running these tests at the same time and the resource group already exists, but maybe they're both trying to create the same resources with potentially the same names?
in general, am more interested in what happens / how you envision this working when things go wrong (either with github actions etc like we have with qemu at the moment) or for when tests fail and potentially we have resources partially provisioned etc.

safoinme commented 1 year ago

@strickvl To address the questions:

If there is some problem within the provisioning of the VM, triggering a new run should fix the problem unless there are some changes that are causing the failure, If the tests we want to run within the VM fail the destroy will still be called and the azure resources will be deleted.
That's a very good question and scenario (2PRs running at the same time) that we may want to test as I don't have a clear answer as to how would it behave, I think we can add a check if provisioning of resource is done or not and react based on it, but problem with this is that we never the main run that triggered provisioning is done it will trigger destroy

strickvl commented 1 year ago

@safoinme the runner doesn't seem to run, however. Something seems missing? or I'm not sure what's going on.

safoinme commented 1 year ago

@strickvl Yes, I was looking for the reason this morning it turns out that our token got invalidated because it wasn't used for so long, now we need to generate a new one. This is a big problem that I don't think we have a potential solution for unfortunately because there is no API to token generation, so if this happened we need to generate it manually and set it in the VM config

strickvl commented 8 months ago

Now that we know how to do the self-hosted runners, should we close this branch? We have a ticket to implement integration tests which we can separately do. @safoinme WDYT?

safoinme commented 8 months ago

I agree let's close this

safoinme commented 8 months ago

Now we have self-hosted runners implemented with ARC on an organization level.

zenml-io / mlstacks

Create self-hosted runner for integration(-ish) CI tests #75

Introduction