wpilibsuite / allwpilib

Official Repository of WPILibJ and WPILibC
https://wpilib.org/
Other
1.08k stars 611 forks source link

Glass cyclically disconnects in NT4 mode #5263

Closed chauser closed 9 months ago

chauser commented 1 year ago

Transferring this from Discussion #5262 We observed several times over the season that glass would enter a mode where it would connect, disconnect within a second, reconnect, etc. When it occurred it would happen on multiple glass clients (2) connected to the robot at the same time. Switching to NT3 mode ended the cycle. This continued even with the latest WPILib release.

At the time there was no opportunity to gather more data about what was happening and I do not know how to reproduce the problem.

I'm putting it here as a discussion rather than an issue to find out if others have noticed the behavior. I'll dig deeper if it happens again, but we are going to be interacting with the robot a lot less so it may not come up again for us this year.

@PeterJohnson replied:

This has been reported a few times. In general it is more likely to happen when there are a large number of topics (~1000), or a large amount of data per topic, and a constrained network environment (poor wireless). It was more common in earlier releases, as we've made a number of changes to try to address this throughout the season, but it's been too high risk to make the more significant changes required to completely address the underlying issue mid-season.

Fundamentally the cause of this is the amount of data that needs to be sent in response to the initial subscription causes a backlog in the network connection, which results in the server terminating the connection if the backlog doesn't clear (and there's more data to be sent) within ~1 second. Glass creates a bigger challenge for this than other dashboards, because it subscribes not only to the topics, but also to all the "meta" topics (these are topics that describe the clients, publishers, and subscribers for each of the "real" topics), which roughly translates into 3x the number of topics and initial data being sent.

I have a few strategies in mind for addressing the issue, the first one is the real solution.

PeterJohnson commented 1 year ago

Fixed in #5659.

AngleSideAngle commented 9 months ago

Similarly to #5817, my team, and several others (@Bankst @shueja) continues to experience this issue in Glass and OutlineViewer 2024.1.1 on Windows 10.

PeterJohnson commented 9 months ago

Yes, this came back due to a late change. Reopening until that is fixed.

chauser commented 9 months ago

We are also seeing this. The behavior is a little different from last year in that the connect/disconnect oscillation occurs without ever displaying any date (except the connection status) whereas last year it would show a lot of data and then disconnect. Further when I left it running for awhile it eventually succeeded in connecting -- and both the glass and outline viewer succeeded within a second or two of each other.

chauser commented 9 months ago

I have a wireshark log that I can upload in an hour or so if that would be helpful.

On Tue, Jan 16, 2024 at 8:32 PM Peter Johnson @.***> wrote:

Yes, this came back due to a late change. Reopening until that is fixed.

— Reply to this email directly, view it on GitHub https://github.com/wpilibsuite/allwpilib/issues/5263#issuecomment-1894921813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGLUU757HOPFJAQAVWMYPDYO5H6VAVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEZDCOBRGM . You are receiving this because you authored the thread.Message ID: @.***>

chauser commented 9 months ago

As promised - wireshark packet capture log. I haven't tried to analyze it since it appears the cause of the problem may already be known. If that's not true and you'd like me look at it in more detail let me know. GlassOscillatingConnection.pcapng.gz

chauser commented 9 months ago

A thought about the websocket keep-alives implemented by WS PING and PONG messages. If I'm understanding the comments on PR #5659 correctly, if a PONG response is not received within a timeout period after a PING is sent, the connection will be closed. Couldn't anything received on the websocket after the PING was sent be considered as adequate evidence that the connection is still alive? It seems to me that this would solve the problem of PINGs and PONGs getting queued behind a great deal of other data leading to the timeouts that we are observing (at least the wireshark capture seems to support that that is what is going on.) For actually-broken connections this should not cause significantly greater delay in detecting the broken connection but for still-working connections it should avoid detecting them as being broken.

PeterJohnson commented 9 months ago

Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long).

The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause.

chauser commented 9 months ago

What I'm seeing in the Wireshark log is a single Websocket frame of 127892 bytes made of 88 TCP segments. This is on a bandwidth-limited FRC radio (4mbit/s?). Early on, there is a dropped TCP packet that eventually gets resent by TCP Fast retransmission; receipt of that packet completes the Websocket frame. It pretty consistently takes 285ms to receive all those segments. This is all pretty much agrees with what I'd expect to see when sending that much data on a 4mbit/s link. It's not clear that just fragmenting at the websocket level would help. The kernel tcp stack would happily accept all of those fragments into its own buffers. And all that data would still be "ahead" of any ping or pong messages. But maybe the intention is to pace the sending of those segments?

On Thu, Jan 18, 2024 at 6:48 AM Peter Johnson @.***> wrote:

Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long).

The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause.

— Reply to this email directly, view it on GitHub https://github.com/wpilibsuite/allwpilib/issues/5263#issuecomment-1898621776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGLUU7GKYW5OB2EIN2CLRDYPEY23AVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGYZDCNZXGY . You are receiving this because you authored the thread.Message ID: @.***>

chauser commented 9 months ago

I just did another experiment, this time connected by USB. I still got the cycling behavior but it was different in two ways: first, the glass viewer would actually populate with NT data before the disconnection happened; and second, the wireshark logs do not look like a timeout is occurring. Now I wonder if there are 2 different causes of this behavior -- the corruption/incorrect frames that you are looking at Peter, and timeouts on low-bandwidth connections.

sciencewhiz commented 9 months ago

Is this fixed in 2024.2.1?

chauser commented 9 months ago

I have not seen it with 2024.2.1 either with our actual robot code or with Peter's stress test code, both of which would always trigger it.

ghost commented 9 months ago

We occasionally see this happen on all NT4 clients; shuffleboard, a custom dashboard, PhotonVision, etc.

chauser commented 9 months ago

Are you using clients built against WPILib 2024.2.1? Full benefit of the changes is only achieved if both the robot code and the clients use the new implementations.