web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
4.81k stars 3k forks source link

Infrastructure for Azure pipelines underdimensioned #33980

Open frivoal opened 2 years ago

frivoal commented 2 years ago

I don't know precisely how this is provisioned, but it seems like the infrastructure that runs "Azure pipelines" in the continuous integrations tests is undermentioned. Once it runs, it's pretty fast, but it can stay queued for extended periods of time.

For instance, https://github.com/web-platform-tests/wpt/pull/33940 was blocked for about 2 hours waiting for the Azure Pipelines to be run. In the grand scheme of things, 2h may not be that much, but it's very different from 10 minutes, and changes a task that you can do in one sitting into something you have to handle in multiple work sessions, which is unwelcome overhead. If possible, it'd be nice to reduce that delay.

Thanks!

foolip commented 2 years ago

We have a maximum of 20 parallel jobs on Azure Pipelines, but after https://github.com/web-platform-tests/wpt/pull/33755 + https://github.com/web-platform-tests/wpt/pull/33861 we can trigger up to 40 jobs at the same time, each of which is expected to take ~2 hours.

I've sent https://github.com/web-platform-tests/wpt/pull/34015 so that only 16 jobs get triggered every 3 hours, but we will still have a backlog every day, and the same 40 jobs as currently once a week, leading to delays.

@mustjab do you think there's anything we could do about the quota? Or other ways to solve this?

mustjab commented 2 years ago

I think we can stop Edge Dev runs and can just do Canary runs for now. Also, for the weekly run, can we schedule to run on the weekend when we have fewer runs?

@foolip Do you remember who you worked with to increase the parallel job limit before? I can also try to outreach to them and see if we can increase that a bit more.

foolip commented 2 years ago

@mustjab I don't know for certain, but I think it was @thejohnjansen who asked someone on the Azure Pipelines team to increase the limit internally. The mechanism for doing that wasn't visible to me, I could only see the increased parallelism take effect.

Regarding Edge Canary, note that because of https://github.com/web-platform-tests/wpt.fyi/issues/1635 we don't show those runs on wpt.fyi. However, with that issue fixed we could start using the Edge Canary runs instead. In any event, I think we should run either Edge Canary or Edge Dev, not both.

mustjab commented 2 years ago

Let's stop Edge Canary runs and keep only Edge Dev channel runs. We can switch these runs to daily instead of every 3 hours to help reduce the load. Does that work?

foolip commented 2 years ago

@mustjab https://github.com/web-platform-tests/wpt/pull/34015 was merged which will run Chrome Canary only once a day. We could remove it entirely, if you like. However, running Edge Dev less frequently wouldn't be great because it would mean that the wpt.fyi front page gets new aligned runs less often. And it would take longer to recover from any infra issue on any browser.

Also note that peak usage does not decrease at all unless we find a mechanism to spread out runs over time, since currently epochs/three_hourly and epochs/daily are both updated at the same time once a day. Getting backlogged once a day is better than it happening every 3 hours of course, but it would still affect wpt contributors.

mustjab commented 2 years ago

Thanks for merging that. Let's see if that helps with the load and if we still see issues with that, then we can stop these runs until we figure out a way to increase the limit.

For Edge Dev channel, is there a different cadence that we can do other than every 3 hours? Maybe every 6 hours, to reduce the load? That should still keep the wpt.fyi front page results fresh enough.

TalbotG commented 1 year ago

I don't know precisely how this is provisioned, but it seems like the infrastructure that runs "Azure pipelines" in the continuous integrations tests is undermentioned. Once it runs, it's pretty fast, but it can stay queued for extended periods of time.

I fully agree with you. This has happened to me on may 10th 2023 (see #39947) and presumably also in #34926.