antross commented 6 years ago

Travis has been really slow and we've had trouble getting Windows to work. Need some investigation to determine if we can improve the speed and reliability of our existing pipeline or if we should consider migrating to an alternative CI solution.

Current status

The webhint project is using Travis CI as its CI system. It's used in all the project that require a build step: hint, webhint.io, and online service.

The project that has the most complex configuration is hint. We are testing for Linux and Mac and for 3 different node versions:

Current
LTS
LTS - 1

Travis offers 5 workers that are shared across the whole organization. To speed up the CI process, we split hint's build into multiple chunks that run the tests for the things that have been modified. While this has the benefit of having faster build times for a given PR (running all the tests on Travis was getting close to 1h for a given OS and node target), the problem is that we run less simultaneous branches builds. In fact, right now a build in hint spawns 12 jobs. A temporary workaround would be to disable builds for a specific target (probably node 10). This will improve a bit the build times but it's more of a band-aid than a long term solution.

While there aren't that many changes on any given day, sometimes greenkeeper can generate a few branches and PRs making the queue extremely long and slowing progress. It will be nice to de-prioritize commits from specific authors or keywords (ala [skip ci]).

In the past we were using also AppVeyor but we stopped because results were unreliable. Travis recently added support for Windows builds but couldn't make it work without any changes.

Other issues we have are reliability running the tests. Sometimes there are workers that fail and running the job again is enough to pass the build. It will be nice to have an auto-retry option.

Also, right now we are having more and more targets: npm, VS Code, browsers, etc. It will be nice to auto-publish when possible, even have a developer version (@next) that automatically gets promoted periodically.

In summary, this is what we want to solve:

Faster build times
Windows, macOS and Linux builds
Process some PRs/Commits faster than others (e.g.: greenkeeper always the less priority)
Auto-retry failing jobs
Auto-publish packages to beta/stable channels

Possible solutions

Continue with Travis

The main advantage of Travis is that it's already working. There are a few things we should do:

Remove one of the build targets. This should improve build times by 1/3 but the trade off will be that we don't have that much coverage
Reduce the number of commits to test back when testing in master (currently is 5)
Create all the scripts necessary to publish in VS Code, stores, etc. (this is needed for all the other solutions)

The main cons are:

Builds are still really slow (we run all the tests sequentially).
The doesn't seem a way to change the priority of queued builds (some related issues: https://github.com/travis-ci/travis-ci/issues/2801, https://github.com/travis-ci/travis-ci/issues/1155, https://github.com/travis-ci/travis-ci/issues/8828).
Debugging Travis is painful (not that other systems are a lot better, but not having priorities forces you to setup Travis for your fork and then duplicate all the works and cross-fingers everything works).
Windows builds right now just fail after installing node. See latest output of the results).
We are still limited to 5 workers and there doesn't seem to be a way to pay for more.

AppVeyor

AppVeyor supports Linux builds but not macOS. Also we disable it because of reliability problems when running the tests. On top of that, they take quite some time to update their VMs to support newer versions of node.

I don't think it's a viable option.

Azure Pipelines

Azure Pipelines is a new CI/CD system by Microsoft. It supports Linus, macOS and Windows and it's free for OSS projects with 10 free parallel jobs. We can also pay for more parallel jobs ($40/mo).

Pros:

More parallel jobs than any other free alternative.
Windows, macOS, and Linux on the same system.
- Builds seem to be faster than Travis. This is a run passing in all OS, although only on node 8 (although Linux seems to execute less work for whatever reason 🤔)
- We might be able to reduce the number of split jobs if the machines are powerful enough.
Separation between builds and releases
- Releases integrate with plenty of services, including npm, and we should be capable of creating new things for our needs (VS Code marketplace and browser stores if needed)
- Releases can be scheduled. We can probably use them for cron jobs or to promote builds from dev to stable if we add some gates.
Dashboard and build output are open to anyone and integrate nicely with GitHub (same as other solutions like Travis and AppVeyor).
No need to manage the servers.
Can probably get support from the team.
We can ignore topic branches, this could be useful to avoid testing twice (branch + PR).
Easy to publish to Kubernetes (so possibility to auto-deploy the online service more easily)

Cons:

There doesn't seem to be a way to change the build priority yet.
Need to move everything to Azure DevOps (including the website's code, etc.)
You need to use yml to configure it (although that's more of a personal issue...), but you can use templates to split them.
Maintainers with access to the dashboard might need an account different than a GitHub one. Need to double check on this.

Jenkins

Jenkins is a self-contained, open source automation server which can be used to automate all sorts of tasks related to building, testing, and delivering or deploying software.

Jenkins can be installed through native system packages, Docker, or even run standalone by any machine with a Java Runtime Environment (JRE) installed.

This means it can run on Windows, macOS, and Linux. We can also scale by adding more workers but the biggest con is that we have to maintain the whole infrastructure, update versions, deal with security issues, authentication, etc.

You can use pipelines, written using Groovy syntax, to define the different build and release steps. There is a new UI called Blue Ocean which is a lot better than what was available a few years back.

Pros:

Support for all platforms
Lots of plugins for most of our needs
Based on my previous experience is easier to define pipelines with Groovy than yaml (or at least is easier to validate the files are correct)
We can scale by adding more workers

Cons:

We will have to maintain the whole infrastructure: OS, Jenkins, plugins, node, workers, etc. I have to do that for another project and it's not fun (and we only have one instance). This is the biggest con.
We need a cloud service and pay for it. Cloudbees has some free tiers for some of their products but I'm not sure how they work and if they can compete in terms of parallelization with other free solutions.

Summary

The following is a table with the summary of the above. The one that looks more promising for now is Azure Pipelines so I'll continue investigating it a bit more unless someone else has another idea.

	Travis	AppVeyor	Jenkins	Azure Pipelines
multiOS	✅ (beta)	❌ (mac)	✅	✅
Free	✅	✅	❌ (infra)	✅
# workers	5/org	1	♾ (VMs)	10/project(?) + $40/mo per extra
Plugins	❌	❌	✅	✅
Language	yaml	yaml	Groovy	yaml + templates
Managed	✅	✅	❌	✅
Queue priority	❌	❌	✅ (plugin)	❌
Services integration	Manual	Manual	Via plugins	Via plugins

molant commented 5 years ago

@webhintio/core I'll be updating the first comment with the research I do, feel free to let me know if I miss any requirements.

molant commented 5 years ago

Things to test in Azure Pipelines:

[x] Run multiple versions of node per platform
[x] Check how long it takes to test everything
- A full Windows build takes about 40min. Other OS should be less (but they failed so can't say exact numbers)
[x] Run side by side with Travis to see which one goes faster (will re-enable Travis on my fork)
- Azure finished in 40min builds for all OS (some of them failed), Travis is still going after 53min (and it's not running Windows builds)
[x] Fix differences between builds (different yarn version in the agents)
[x] Use templates for the build
[ ] Make sure tests with a browser run correctly
- [x] Chrome is not installed on the macOS agent
- [ ] On Linux there are a lot of issues connecting to localhost on all node versions message: 'Problem loading the website http://localhost:9941/'
[ ] Find a way to restart an agent in case it fails
[ ] Find a way to retry builds/jobs automatically (maybe through conditions?)
[ ] Find how to add secrets (we can use Azure Keyvault but there's probably something else that requires less infrastructure and is still secure)
[ ] Create a release for the website to see how the integration with Azure WebApps is (probably use a new site before fully migrating)

@webhintio/core something else I should

kaylangan commented 5 years ago

I'm a Program Manager on Azure Pipelines. To answer a few of your questions:

You need to use yml to configure it (although that's more of a personal issue...), but you can use templates to split them.

We do have a visual designer, but we do strongly encourage using yml.

Maintainers with access to the dashboard might need an account different than a GitHub one. Need to double check on this.

At the moment, yes, you will need a different account. We are currently working on integrating with GitHub identities and that's coming soon.

Find how to add secrets

Check out the docs here.

molant commented 5 years ago

Hi @kaylangan, thanks for your answers!

I have a couple issues so far with Pipelines that I hope you can help me with:

Is there a way to restart an agent that has failed? Similar to what Travis allows you? I asked this on Twitter yesterday
Is there an easy way to find the name of the built-in and marketplace tasks? I started with a yml file but find out I needed to install a yarn task to make sure all the agents were using the same version and the only way I find out was to start creating a new pipeline using the editor and then switching to yml to copy paste 😞

kaylangan commented 5 years ago

Is there a way to restart an agent that has failed? Similar to what Travis allows you? I asked this on Twitter yesterday

On the Checks tab, you can click "Re-run":

Is there an easy way to find the name of the built-in and marketplace tasks? I started with a yml file but find out I needed to install a yarn task to make sure all the agents were using the same version and the only way I find out was to start creating a new pipeline using the editor and then switching to yml to copy paste 😞

For the tasks, you can find the tasks repo here. In the task.json file, you can find the name of the task. For example, for the UsePythonVersion task: https://github.com/Microsoft/azure-pipelines-tasks/blob/master/Tasks/UsePythonVersionV0/task.json#L3.

molant commented 5 years ago

I think I have everything I need for this research. Azure Pipelines looks pretty solid and a good option for us and now is more of a question of tweaking and polishing. I'll keep working on my fork until I have something ready to get merge into the main repo and open individual task for each thing.

webhintio / hint

Research CI Improvements [3] #1621

Current status

Possible solutions

Continue with Travis

AppVeyor

Azure Pipelines

Jenkins

Summary