wpilibsuite / allwpilib

Official Repository of WPILibJ and WPILibC
https://wpilib.org/
Other
1.05k stars 612 forks source link

Network Tables client should survive network disruption? #5146

Closed truher closed 1 year ago

truher commented 1 year ago

Describe the bug My (python, raspberry pi) network tables clients enter a zombie state upon network interruption: they think they're still connected (isConnected yields true), but they're not (the NT server sees nothing from them). Is this expected?

To Reproduce Minimal test case below; uncomment the two commented lines to "fix" it.

import time
from ntcore import NetworkTableInstance

inst = NetworkTableInstance.getDefault()
inst.startClient4("myclient")
inst.setServer("10.1.0.2")
pub = inst.getTable("mytable").getIntegerTopic("mytopic").publish()
counter = 0
while True:
    pub.set(counter)
    counter += 1
    if counter > 10:
        print("reconnecting...")
        counter = 0
        # inst.stopClient()
        # inst.startClient4("myclient")
    time.sleep(0.5)

Expected behavior Clients should either be aware of disconnection (isConnected should yield false), and/or they should reconnect silently.

Desktop (please complete the following information): WPILibPi 2023.2.1 for the client WPILib 2023.4.1 for the server, running in simulation on my laptop, which is Windows 11, Java 19.

PeterJohnson commented 1 year ago

Odd, a network interruption should indeed cause the client to automatically start reconnecting. How are you interrupting the network?

truher commented 1 year ago

just pulling the ethernet cable, or depowering the switch briefly.

PeterJohnson commented 1 year ago

Ok. There's likely to be some amount of delay before it notices that kind of disconnect, but since you're writing data to it, it should only be a few seconds. If you're reconnecting them before they completely terminate it should just shortly resume from where it left off without ever disconnecting (the TCP layer should be trying to retransmit data).

truher commented 1 year ago

yeah, I tried various durations of disconnect between a second and maybe ten seconds. the way I noticed the issue was that the switch power would come loose, and it would take awhile for me to notice. in the zombie state the python keeps doing everything else (video analysis in the real case), so it's not hanging, the publisher.set() call returns immediately, and the video stream itself also works fine.

PeterJohnson commented 1 year ago

Are value updates visible on the server side again after the reconnect? They should be.

truher commented 1 year ago

yes.

PeterJohnson commented 1 year ago

Ok. So what you're seeing is normal TCP behavior. It's designed to "operate through" this type of short-term disconnect. As we may want our connection to fail at least somewhat faster (<10 seconds), I'll look into options we can do to detect this and more quickly force a disconnect. E.g. on the client side, we do send a ping about once a second, so we could force-disconnect if we don't get a ping response from the server within a couple of seconds.

truher commented 1 year ago

? I think I didn't communicate well. without my "fix" it never recovers.

PeterJohnson commented 1 year ago

That doesn’t sound right at all. Either the TCP connection should recover and it keeps going, or it dies and it triggers a reconnect. I’ll try to reproduce.

truher commented 1 year ago

woo!