Don't queue ICE candidates in tests (remove coupleIceCandidates)

jan-ivar commented 4 years ago

coupleIceCandidates is a wpt webrtc helper with this comment:

// Alternate function to exchange ICE candidates between two
// PeerConnections. Unlike exchangeIceCandidates, it will function
// correctly if candidates are added before descriptions are set.

I see no spec support for needing to queue ice candidates to avoid intermittents. AFAIK this is only a problem in Chrome. If I'm wrong, and there's a spec flaw, I'm happy to be shown it, but no-one has challenged my assessment in https://github.com/web-platform-tests/wpt/issues/12028#issuecomment-448015553 in almost a year.

We should remove this function and use the existing exchangeIceCandidates throughout. WPT tests should not work around Chrome bugs, because doing so hampers progress, which seems evident here: Chrome has had a year to look into and fix this bug, but haven't. If people in the field need to do similar tricks to make Chrome work reliably, then this can clearly hurt perceptions of the API and thus adoption.

cc @alvestrand @henbos

henbos commented 4 years ago

So what is the time when addIceCandidate should be called? I don't think I understand the issue well enough.

I'm working on implementing an operations chain at the moment. I had SLD/SRD/createOffer/createAnswer in mind for this, but I should make use of the chain in addIceCandidate as well.

henbos commented 4 years ago

"That is: addIceCandidate gets queued behind any pending setRemoteDescription."

That would only be true if addIceCandidate is invoked before setRemoteDescription. Are we talking about the offer or answer? Different PCs have different operations chains, right?

henbos commented 4 years ago

If I understand correctly, one must not invoke addIceCandidate when we are in 'stable' state, is that correct?

await pc1.setLocalDescription(await pc1.createOffer());
// 1: Can pc1.onicecandidate fire here, between SLD and SRD? pc2.signalingState is 'stable'.
await pc2.setRemoteDescription(pc1.localDescription);
await pc2.setLocalDescription(pc2.createAnswer());
// 2: Can pc1.onicecandidate fire here, between SLD and SRD? pc2.signalingState is 'stable'.
await pc1.setRemoteDescription(pc2.localDescription);

I might imagine the answer to 1) is "no" due to promise magic, and pc2.SRD happening immediately on pc1.SLD resolving, before onicecandidate has had a chance to fire?

But what about 2)?

Anyway, I see this comment:

...because the ICE agent mustn't start producing candidates until after the SLD success callback, and by then the other end already has SRD underway.

There may possibly be a Chrome bug here, I'm not sure. The WebRTC implementation of SLD is "completely finish executing SLD on the webrtc signaling thread (including spinning up things on the network thread) and then, post a message to the signaling thread to invoke the callback to inform Chrome that SLD completed - which happens in a post to the javascript thread".

I don't know how the order of messages are processed, but assuming the ICE agent acts on the webrtc network thread and not the webrtc signaling thread, it might be possible for it to generate ICE candidates and to post back to the webrtc signaling thread that ICE candidates are available, before SLD has had a chance to post to itself saying it is complete.

So perhaps the following order is possible: "SLD, onicecandidate, SLD complete callback"

The fact that SLD's callback is delayed is something I wanted to fix before, but I ended up breaking some dependent project, so I left it that way for backwards-compatiblity reasons. If this is actually causing ICE candidates to be signalled before SLD complete callback, we should definitely fix Chrome.

henbos commented 4 years ago

@jan-ivar, @alvestrand We should verify, but this would be a possible explanation.

jan-ivar commented 4 years ago

I'm working on implementing an operations chain at the moment.

If Chrome is missing the operations chain here, that would explain it!

I had SLD/SRD/createOffer/createAnswer in mind for this, but I should make use of the chain in addIceCandidate as well.

Yes, it's required by the spec.

If I understand correctly, one must not invoke addIceCandidate when we are in 'stable' state, is that correct?

No, thanks to the chain, the state at time of invocation isn't important, order is. What matters is: one mustn't invoke addIceCandidate(candidate) before SRD(description) of the description the candidate belongs to.

E.g this is fine:

pc.setRemoteDescription(offer);
pc.addIceCandidate(candidate); // Invoked in "stable"; processed in "have-remote-offer"

This is not fine:

pc.addIceCandidate(candidate); // rejects with InvalidStateError
pc.setRemoteDescription(offer);

Putting the above into this fiddle shows the problem.

First, here's Firefox's output, which is to spec:

---wrong order: ----------
addIce InvalidStateError
SRD success
---right order: ----------
SRD success
addIce success

Compare to Chrome:

---wrong order: ----------
addIce OperationError
SRD success
---right order: ----------
addIce success
SRD success

Here addIce succeeds before SRD, violating the spec. It also rejects with the wrong error.

jan-ivar commented 4 years ago

Speaking of operations queue, something else we should test, but currently only works in Firefox.

alvestrand commented 4 years ago

"What matters is: one mustn't invoke addIceCandidate(candidate) before SRD(description) of the description the candidate belongs to."

How do we guarantee that? Queueing candidates until SRD is complete accomplishes it, but is overkill and not required by spec; queueing candidates until SRD starts would be according to spec; ensuring that candidates aren't created until SRD starts would also accomplish it, but I don't see how any piece of the spec guarantees that; the two PCs aren't linked in any way.

alvestrand commented 4 years ago

the following code is legal, but will (I think) always fail:

pc1.oncandidate = (c) => pc2.setIceCandidate(c.candidate) offer = await pc1.createOffer() pc1.setLocalDescription() setTimeout(pc2.setRemoteDescription(offer), 1)

there are multiple ways for someone to get the effect of the timeout without being very clear in his mind that this is going to happen. Using coupleIceCandidates() helps them not shoot themselves in the foot.

Simpler-to-use helpers are better, when we're not testing the specific feature targeted by the test.

henbos commented 4 years ago

My take:

There MAY be a bug in Chrome where onicecandidate fires too early due to our threading model, this needs to be confirmed. There should definitely be a test that catches this out explicitly if this is the case, failing Chrome.
(Assuming correct implementation) exchangeIceCandidates is only ever not flaky under assumptions that pc2.SRD follows IMMEDIATELY after pc1.SLD. This seems like a bad assumption to bake in to a generic helper function used in almost any test, and by extension, to almost every test in WPT, even though most tests are written this way.
Generally speaking (behavioral driven test design principles), we don't want a test that tests behavior X to fail because of not passing behavior Y. While less code and less helper logic is good (we don't want to end up testing an intermediate API layer; we want to test the PC directly), it's also good that our test suite covers as many behaviors as possible, and doesn't fail for unrelated reasons. While more tests failing on Chrome would put more pressure on Chrome to fix bugs, it would do so at the cost of making WPTs less reliable indicators as to what is or is not supported.

Remember addTrack? Chrome implemented it supporting the no stream argument case, but that ended up not running well on other browsers. So we updated most of the tests to take a stream as an argument, so that all relevant behaviors would be exercised on all browsers by not failing due to missing stream argument, even though the tests were testing things that didn't need a stream. We put the "no stream" as a separate test.

Proposal:

Add an explicit test that catches Chrome's bug, but don't remove coupleIceCandidates for now if it causes flakes.
We may want to remove coupleIceCandidates in the future, but per Harald's comment, there may be a need for it yet.

henbos commented 4 years ago

If Chrome is missing the operations chain here, that would explain it!

It doesn't have one, but it almost has one. It has a weird threading model that most of the time does the trick anyway. Basically, if you call SRD followed by addIceCandidate, this will ALWAYS execute SRD before addIceCandidate. Why? Both SRD and addIceCandidate are executed, one after the other, on the same thread. Sometimes we block JS, and sometimes we don't, but we always get the same order. Only when we have something cached in the JS layer do we truly avoid jumping threads.

I think the problem here being that JavaScript becomes aware of SLD completing with a small delay, so when you do this:

await pc1.setLocalDescription(offer);
pc2.setRemoteDescription(offer);

You're kind of implicitly doing this really:

await pc1.setLocalDescription(offer);
await new Promise(p => setTimeout(p, 0));
pc2.setRemoteDescription(offer);

That is, if the unconfirmed bug is what I think it is.

jan-ivar commented 4 years ago

This is already guaranteed with naive signaling (no later than the SLD success callback):

pc.onicecandidate = e => signaling.send(e.candidate);
pc.setLocalDescription(offerOrAnswer).then(() => signaling.send(pc.localDescription));

This guarantees the following order on the signaling channel (e.g. with 3 candidates):

offer candidate candidate candidate

...in both directions:

answer candidate candidate candidate

This in turn guarantees correct processing order on the receiving end, using obvious signaling:

signaling.onmessage = async ({data: {description, candidate}}) => {
  if (description) {
    await pc.setRemoteDescription(description);
    ...
  } else if (candidate) {
    await pc.addIceCandidate(candidate);
  }
}

I.e. guaranteeing the following invocation order:

pc.setRemoteDescription(description);
pc.addIceCandidate(candidate);
pc.addIceCandidate(candidate);
pc.addIceCandidate(candidate);

I.e. addIceCandidate(candidate) never happens before the related SRD(description).

jan-ivar commented 4 years ago

The above was in response to @alvestrand's "How do we guarantee that?"

jan-ivar commented 4 years ago

I'm happy to have identified the Chrome bug. Let's get that fixed, and then remove the helper that wallpapered over it for a year.

As to the benefits of helpers, that's subjective. I'm not against all helpers. I'm sympathetic to not having every test fail on behavior Y—the above is my compromise here—but my experience is that asynchronous helpers are problematic when done wrong, basically when they obfuscate asynchronous spec APIs. See https://github.com/web-platform-tests/wpt/issues/19890.

jan-ivar commented 4 years ago

Simpler-to-use helpers are better, when we're not testing the specific feature targeted by the test.

For those, why not use the spec's abstraction for negotiation: perfect negotiation? Option 3 in https://github.com/web-platform-tests/wpt/issues/19890#issuecomment-546444576.

henbos commented 4 years ago

I've documented the problem here: https://bugs.chromium.org/p/chromium/issues/detail?id=1019232

fippo commented 9 months ago

The bug is marked as fixed so can we close this issue (and the other ice candidate ones, #13906 and #12028)? I think we haven't seen many issues recently.

web-platform-tests / wpt

Don't queue ICE candidates in tests (remove coupleIceCandidates) #19866