wpilibsuite / allwpilib

Official Repository of WPILibJ and WPILibC
https://wpilib.org/
Other
1.05k stars 612 forks source link

[ntcore] Segfault on RIO NT4 server #5004

Closed jwbonner closed 1 year ago

jwbonner commented 1 year ago

We observed the NT server on the RIO crash with a segfault (WPILib 2023.2.1). This has only happened once; we haven't been able to recreate it. It maybe seems related to clients connecting and disconnecting? The crash log is below, and I also included the full DSLOG. We'll continue to monitor and get more info if this happens again.

2023_01_24 18_58_49 Tue.zip

NT: Got a NT4 connection from 10.63.28.143 port 51555 
NT: CONNECTED NT4 client 'outlineviewer' (from 10.63.28.143:51555) 
NT: Got a NT4 connection from 10.63.28.201 port 38036 
NT: CONNECTED NT4 client 'northstar@1' (from 10.63.28.201:38036) 
NT: DISCONNECTED NT4 client 'northstar@1' (from 10.63.28.201:49198): stream error: ECONNRESET 
NT: NT4 socket error: socket is not connected 
NT: NT4 socket error: connection reset by peer 
# 
# A fatal error has been detected by the Java Runtime Environment: 
# 
#  SIGSEGV (0xb) at pc=0xaa32a6e8, pid=2332, tid=2364 
# 
# JRE version: OpenJDK Runtime Environment (17.0.3.7) (build 17.0.3.7-frc+0-2023-17.0.5u7-1) 
# Java VM: OpenJDK Client VM (17.0.3.7-frc+0-2023-17.0.5u7-1, mixed mode, emulated-client, g1 gc, linux-arm) 
# Problematic frame: 
# C  [libntcore.so+0x9f6e8]  (anonymous namespace)::SImpl::SetValue((anonymous namespace)::ClientData*, (anonymous namespace)::TopicData*, nt::Value const&)+0x194 
# 
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again 
# 
# An error report file with more information is saved as: 
# /tmp/hs_err_pid2332.log 
*** Error in `/usr/local/frc/JRE/bin/java': corrupted double-linked list: 0xad9ebf48 *** 
PeterJohnson commented 1 year ago

Thanks I’ll do some code review of this area to see if I can find anything.

PeterJohnson commented 1 year ago

I'm not seeing anything obvious, everything should be getting cleaned up on a client disconnect. Combined with the message re: corrupted double-linked list, however, my suspicion is that something is corrupting memory with a use-after-free... unfortunately this could be occurring anywhere in the C++ level of wpilib (or vendor libraries), not just ntcore. Out of curiousity, what vendor libraries are you using?

jwbonner commented 1 year ago

We have Phoenix 5.30.3, REVLib 2023.1.2, and AdvantageKit. I'll check if there's anything obvious in the native AdvantageKit component too.