wpilibsuite / allwpilib

Official Repository of WPILibJ and WPILibC
https://wpilib.org/
Other
1.03k stars 607 forks source link

NetworkTables server crashes when publishers are rapidly created and closed #6734

Open brettle opened 3 weeks ago

brettle commented 3 weeks ago

I'm seeing fairly repeatable crashes that have ServerImpl.cpp:1859 in the stack trace when running code that is rapidly creating and removing (ie. closing) publishers. Here is the immediately preceding code:

https://github.com/wpilibsuite/allwpilib/blob/3d6b710293d1e78f2d43834068cf610f53440f0f/ntcore/src/main/native/cpp/net/ServerImpl.cpp#L1852-L1860

My theory is that another thread is removing one or more publishers after count is computed and as a result the mpack array is undersized. Is there something in place that is supposed to prevent that from occurring?

PeterJohnson commented 3 weeks ago

The server is single threaded (it’s an event loop), so that can’t be the cause. Something else must be going on. Probably something isn’t being tracked right in the data structure maintenance, resulting in a null pointer deref or use after free at this code, and the actual cause is elsewhere. How rapidly are we talking about? Is this during unit testing? We of course want to fix the crash, but in general, creating and closing publishers should be done rarely.

brettle commented 3 weeks ago

Thanks for the rapid response. I'll continue to investigate and see if I can create a small reproducible test case.

For context, we have an NT client that is reporting kinematics info for objects in a simulation that is controlled by a JUnit5 test that is acting as the NT server. So, this is a system test, not a unit test. The server/test expresses interest in an object by publishing an empty (not null) double[] to a topic named for the object. The simulation/client listens for such a publication and responds by publishing the relevant info to the same topic. At the moment, both publications are created and closed each timestep and the timestep is currently effectively limited to 5ms by the hardcoded NT limits. I realize that isn't ideal and intend to improve it, but I was expecting performance issues not crashes to be the result. Also, the topic is published as noncached, and for all pubs and subs I'm using PubSubOption.periodic(Double.MIN_VALUE), PubSubOption.keepDuplicates(true), PubSubOption.sendAll(true).

brettle commented 3 weeks ago

The crashes also seem to be associated with seeing the warning from this line:

https://github.com/wpilibsuite/allwpilib/blob/3d6b710293d1e78f2d43834068cf610f53440f0f/ntcore/src/main/native/cpp/net/ServerImpl.cpp#L212

I'm assuming that is occurring because the unpublish is delayed getting to the server (due to the limited wire flush rate) until after a subsequent publish has been made. I can reproduce the warnings in a simple test case. Let me know if that is of interest.

PeterJohnson commented 3 weeks ago

All of that should be done FIFO, so the unpublish should be processed first, then the publish. A test case for that would be great.

brettle commented 3 weeks ago

Test case that produces warnings can be found here.

brettle commented 2 weeks ago

I have a theory about the cause of this bug and would like to know if I'm barking up the wrong tree.

In ClientImpl::HandleLocal():

https://github.com/wpilibsuite/allwpilib/blob/e2893fc1a36720b9c3986f2fc6c9607cea35c8fd/ntcore/src/main/native/cpp/net/ClientImpl.cpp#L121-L122

Unpublish() calls:

https://github.com/wpilibsuite/allwpilib/blob/e2893fc1a36720b9c3986f2fc6c9607cea35c8fd/ntcore/src/main/native/cpp/net/ClientImpl.cpp#L220

which "erases" the handle from m_handleMap. Then when HandleLocal() calls m_outgoing.SendMessage() it tries to retrieve the handle info from the map here:

https://github.com/wpilibsuite/allwpilib/blob/e2893fc1a36720b9c3986f2fc6c9607cea35c8fd/ntcore/src/main/native/cpp/net/NetworkOutgoingQueue.h#L80-L81

Since, the handle is no longer in the map, it adds it with default handle info, specifically a queueIndex of 0, which refers to the default 100ms queue. If the publisher was using that default period, SendMessage() will add the unpublish message to the correct queue and everything will be fine. But if the publisher is using a non-default period, SendMessage() will add the unpublish message to the 100ms queue even though the publish message and value messages may be in a different queue. This can result in duplicate publications if the pubid is reused by the client for a new publication and the associated publish message is sent before the unpublish message that is in the default queue. Make sense?

Creating a simple test case that is guaranteed to reproduce the behavior using the API doesn't seem to be feasible since it is dependent on the the exact timing of the client thread which is not exposed through the API.