Synchronize browsers used by WPT CI and results collector

jugglinmike commented 6 years ago

In our OKR document for Q1 2018, the following text was listed as a KR of the priority-2 objective titled, "web-platform-tests continuous integration is reliable and useful":

wpt CI runs the same experimental browsers as the dashboard

The intent behind this isn't clear, and the acceptance criteria are also somewhat vague. Here's where we stand and where we're headed on the results collection front:

Browser	WPT CI	wpt.fyi (today)	wpt.fyi (soon)
Chrome	Dev	64	Stable & Dev
Edge	n/a	15	15
Firefox	Nightly	59	Stable & Nightly
Safari	n/a	11	11

Since no one is running experimental builds of Safari or Edge, I think it's safe to say those browsers are not relevant for this goal.

The WPT CL process has long run Chrome's "dev" channel, but it's only thanks to @hexcles recent efforts that failures are actually visible.

Once this project is collecting results from the experimental builds of Chrome and Firefox (see gh-388 and gh-521), it seems like we will have reached this goal.

@foolip can you weigh in on this?

foolip commented 6 years ago

I think it would be valuable to ensure that the configuration for Chrome and Firefox are as similar as possible or even identical on wpt.fyi running infra and Travis, that will help towards https://github.com/w3c/web-platform-tests/issues/7475.

So, what are our options for making this true, short of having Travis use the same infra behind the scenes and just waiting, or replacing Travis outright?

Agree to punt on Edge and Safari until we have a CI solution for them.

jugglinmike commented 6 years ago

Our options depend on how widely you define the term "configuration". That might include:

Process level
- Browser version (for F/OSS browsers)
- Command-line flags
- Environment variables
- WebDriver version
- WebDriver flags
System-level
- Browser version (for closed-source browsers)
- Network configuration (e.g. SSL support, hosts file entries)
- Display configuration
- System fonts
- Shared libraries (e.g. .sos and .dlls)

The more similar, the better, but there's a trade-off between parity and the complexity used to achieve it.

If we limit ourselves to "process-level", we could address this today by manually maintaining duplication in the scripts for the two projects. We could maybe go a step further and define an "automation mode" for the WPT CLI which enables all the configuration we're interested in (perhaps also disallowing any additional flags). I'd need to run that by @jgraham before claiming that to be a workable solution, though.

If we want more parity, then manually maintaining synchrony starts to seem onerous. (As requested, I'll ignore the possibility of relying on the same results collection service from both projects,)

We could define a separate repository with system-level configuration scripts. A tool like Puppet or Ansible could be used during setup for CI jobs and for the results collector. We have a proof-of-concept for this in the system we're currently using to collect results.

However, configuring a system "from scratch" may have an unacceptable impact on time-to-results for pull request validation. It's also not a viable option for managing closed-source browsers. For that, we may need to get into the business of maintaining images.

We might be able to do this with Docker. WPT could consume Docker images through its existing TravisCI integration. I'm fuzzy on Docker support for Windows virtualization, though. Furthermore, given that we don't necessarily take TravisCI for granted, we shouldn't over-value solutions that integrate conveniently with that service.

Full machine images may be necessary. With recent trends in "immutable infrastructure," the tooling around this kind of operations management is becoming quite sophisticated.

There may be other alternatives, too. Does this give you any ideas?

foolip commented 6 years ago

Given that we want wpt run to work on tests writers' machines and that browsers themselves strive for consistency across even different operating systems, I think that process level synchronization is a reasonable first step that might be enough for a long time. If you squint you could even claim it's good that we're exercising two system configurations, although that's just incidental and not really a goal in itself.

What could a solution for this in the wpt CLI look like? Would it be such that upgrading the nightly browser version is a PR on wpt?

As requested, I'll ignore the possibility of relying on the same results collection service from both projects

Still interested in what you think. I'm under the assumption that it's kind of an inevitable end point, but that it'll make more sense once we have solutions for Edge and Safari that we're happy with and are fast enough.

jugglinmike commented 6 years ago

What could a solution for this in the wpt CLI look like? Would it be such that upgrading the nightly browser version is a PR on wpt?

I think we'd want to avoid maintaining our own definition of "nightly" within WPT CLI. Generally, it seems most hygienic to rely on the vendors as the source of truth there.

I wrote out some concrete ideas for how the feature would behave, but it probably makes sense to discuss that in web-platform-tests. I've posted my thoughts as two new issues in that project:

Still interested in what you think. I'm under the assumption that it's kind of an inevitable end point, but that it'll make more sense once we have solutions for Edge and Safari that we're happy with and are fast enough.

Beyond the challenges of managing those browsers, the responsibility of pull request validation would also make our workload more irregular. That would be cause to extend (or even re-architect) the current infrastructure. Tools like Consul or Kubernetes would help us scale horizontally, and this may even be cause to re-think the use of Buildbot.

With all the talk about revision announcers and vendor-supplied test results, I actually wasn't sure how you're thinking about testing Edge and Safari on our own. Reading your interest along these lines, I'm happy to start learning more about the problem space and talking to some of the stakeholders.

foolip commented 6 years ago

Thanks for filing those issues, will comment there!

With all the talk about revision announcers and vendor-supplied test results, I actually wasn't sure how you're thinking about testing Edge and Safari on our own. Reading your interest along these lines, I'm happy to start learning more about the problem space and talking to some of the stakeholders.

Using Sauce, or BrowserStack, or anything that could put the browser far away from wptserve on the network seems unworkable in the long run, that'll be a source of flakiness forever unfixable, and can't ever run very fast.

So step 1 is to figure out how we're going to get Edge and Safari results into wpt.fyi on a more sound setup. There are a few options we're exploring there, as you know. Before that it doesn't make too much sense, I think, to contemplate running these browsers on every PR.

But... actually, we don't need to gate a PR solution for Chrome and Firefox on figuring out the wpt.fyi waterfall builds for Edge and Safari.

This is what I'd like:

For the Travis jobs that effectively run wpt run, currently the stability jobs, don't use Travis at all. Instead we have custom GitHub status checks, similar to the "Participation" check on https://github.com/whatwg/xhr/pull/200. Possibly https://developer.github.com/v3/guides/building-a-ci-server/ is the relevant documentation for this.

Those checks don't need to abide by any timeout, and we can just have one per browser configuration. If we ever want to have checks that depend on the results of more than one run ("fails in all? extra bad!") then that could just be a separate check that waits for the rest.

Internally, those checks would offload to something very much like the infra to run all of wpt, but with the --verify or --stability argument added, whichever it is. As an added complication, the whole branch being tested must be treated as even more untrusted code than what's on master, so more serious sandboxing and discarding VMs between each use may be needed.

When would I like it? Dunno, when it seems like the worst problem on our hands, which isn't for a bit longer I think. @lukebjerring?

@jugglinmike, if there isn't an existing issue which covers this and you think it's worth tracking, please go ahead and file an issue :)

foolip commented 6 years ago

@domenic, you built the WHATWG status checks, is https://developer.github.com/v3/guides/building-a-ci-server/ the right documentation to start from?

domenic commented 6 years ago

I think that's what I used, yeah. Indeed there are two main points of interaction with the GitHub API: a webhook to detect new PRs, and the status API endpoint to post new statuses to that PR's commits.

You can browse the code in https://github.com/whatwg/participate.whatwg.org/tree/master/lib; pr-webhook.js is the main file. server-infra/validate-github-webhook.js is also somewhat important.

jgraham commented 6 years ago

If the proposal is to run the stability jobs on custom infrastructure, I reiterate again that we can use TaskCluster, which already has built-in GitHub integration, allows access to substantial resources, can have long time limits, has a significantly better architecture than Travis (or buildbot).

Generally I haven't pushed too hard on this because I understand that people are wary of "nonstandard" solutions, and there are several requirements for wpt.fyi that might not integrate easily into taskcluster, particularly around custom Windows versions and macOS. But if the choice for running stability jobs is a DIY setup or something that is already built and proven at a scale way beyond our needs, then I think we should take advantage of the existing solution.

foolip commented 6 years ago

Can we also use TaskCluster for wpt.fyi, at least for Linux? Having two tech stacks is itself part of the problem/nuisance.

If there's no path for running fresh Windows Insider Preview or Safari Technology Preview on TaskCluster it takes it out of the running for, well, both waterfall and PRs, but we're already moving towards having a diversity of runner infras for wpt.fyi so we should be open to partial solutions where all-encompassing ones seem to not exist.

@jugglinmike, have you done any reading on TaskCluster, WDYT?

jgraham commented 6 years ago

Yes, we could use it for Linux. I have a PR [1] open to have it run every push on master in Chrome and Firefox nightly; if we merge that it could be running right away (the GitHub integration would have to be set up of course). Some integration with wpt.fyi iwould be needed to get the wptreport.json files from each run.

I understand the desire to avoid a plurality of infrastructure. But I have several countervailing concerns:

Building a reliable CI system is hard. Even starting from something like BuildBot, there is still a lot of work required to set it up and substantial ongoing work to keep it running.
The custom solution seems likely to have strong organisational ties to Google/Bocoup. To some extent this is inevitable if we're talking about custom hardware since it needs to physically live somewhere. But it is concerning that the infrastructure will controlled by specific companies, access for outsiders may be limited, and ongoing maintainance will determined by the internal priorities of those organisations.

To the extent that custom hardware, or unusual operating system configuration, is required to get results, I think that outweighs my concerns for wpt.fyi. However I haven't heard those requirements expressed for the stability checking part.

I also understand that there might be similar concerns about Taskcluster being tied to a specific organisation. However there are some differences:

We would be using a publicly available service, not doing anything that specifically depends on Mozilla being involved (I suppose it's possible that the volume of jobs would be high enough to attract attention; in that case the project goals being mission-aligned with Mozilla will be helpful).
TaskCluster is such a fundamental part of Mozilla's CI and release pipeline, it's hard to imagine it going away without a notice period of years, as long as Mozilla itself exists.

[1] https://github.com/w3c/web-platform-tests/pull/9226

foolip commented 6 years ago

On the governance, I share those exact concerns/biases, and my ideal is really that each browser vendors takes care of running its own browser and submitting results to wpt.fyi, but in a way that is in-principle-reproducible, i.e. if people use wpt run locally, they can confirm the results, even if they can't make it run fast.

To get to that world, we think we need to bootstrap wpt.fyi into good shape with frequent runs for almost everything we care about, to make it useful and indispensable, to get the lock-in we need to make the ideal world sustainable or somehow self-reenforcing, where people want to join the party.

If Mozilla wants to run Firefox right away, on TaskCluster or anything, we could set that up in a matter of weeks with what @Hexcles is working on.

I think, though, that there's benefit in running Chrome and Firefox on similar infrastructure, because it must be possible, and should reduce total engineering cost.

At this point, I think that @jgraham and @jugglinmike should talk to each other :)

web-platform-tests / results-collection

Synchronize browsers used by WPT CI and results collector #535