wpilibsuite / 2024Beta

Repository for Beta Testing of 2024 Software
32 stars 5 forks source link

Periodic high CPU usage on roboRIO #32

Closed bhall-ctre closed 8 months ago

bhall-ctre commented 9 months ago

As part of our periodic system testing, we found that the CPU usage of the roboRIO seems to periodically increase to very high levels with an empty TimedRobot project, shown below.

image

It takes 72 minutes for the issue to initially reproduce. Once it does, these cycles of increased CPU usage appear to occur with a period of around 72 minutes (36 minutes high, 36 minutes low). Our testing was performed with the roboRIO2_2024_v1.1 image and WPILib project information:

Project Version: 2024.1.1-beta-3
VS Code Version: 1.83.1
WPILib Extension Version: 2024.1.1-beta-3
C++ Extension Version: 1.17.5
Java Extension Version: 1.23.0
Java Debug Extension Version: 0.52.0
Java Dependencies Extension Version 0.23.0
Java Version: 17
Java Location: C:\Users\Public\wpilib\2024\jdk
Vendor Libraries:
    WPILib-New-Commands (1.0.0)

I've confirmed using top that the spike in CPU usage primarily comes from the robot program (/home/lvuser/frcUserProgram), although interestingly irq/61-atomiczy also seems to be elevated.

Normal CPU usage: image

High CPU usage: image

During high CPU usage, code deploy often fails. Restarting robot code resolves the period of high CPU usage.

calcmogul commented 9 months ago

Maybe the HAL notifier that triggers TimedRobot's periodic functions is tightlooping? The FPGA clock handling changed in https://github.com/wpilibsuite/allwpilib/pull/5499, so it's possible rollover broke.

bhall-ctre commented 9 months ago

I can put together a simple C++ program that logs the measured time delta of robot periodic to test that.

bhall-ctre commented 9 months ago

it's possible rollover broke

My original test for this had some interesting results. I used DataLog to log the measured time delta, forgetting that it uses the FPGA clock for timestamps. My wpilog ended up having a hard cutoff at around 72 minutes (when the issue started to reproduce). That does imply that either robotPeriodic wasn't being called at all after the issue started reproducing, or the FPGA timestamp rollover isn't being handled correctly (and my log actually just jumped to some other timestamp).

Once I'm back in the office after Thanksgiving, I can redo the test with our .hoot logger to get more conclusive results, since it uses a different monotonic timebase.

PeterJohnson commented 9 months ago

It appears the new timestamp function isn't getting the high part of the timestamp, so it's rolling over to zero instead of continuing to increment. It appears the FPGA may not be copying the high part of the timestamp to the memory buffer we grab the timestamp from.

bhall-ctre commented 9 months ago

Yep, that's what I now see in my logs as well. "FPGA Upper" is the value reported by global->readLocalTimeUpper(&status) from the HAL layer, and "FPGA Time" is the value reported by frc::RobotController::GetFPGATime(). Even though readLocalTimeUpper properly reports the upper portion of the timestamp, it looks like the HMB is not reporting it.

image

I also separately proved that all of the periodic functions stop running completely once the FPGA timestamp rolls over, so I don't know what's actually causing the heightened CPU usage. Regardless, fixing the FPGA timestamp should resolve this issue.

PeterJohnson commented 9 months ago

It’s almost certainly something in libuv (used in ntcore)—we use the FPGA timestamp for the libuv timers.

ThadHouse commented 8 months ago

This should be fixed by Beta 4 when it comes out.

bhall-ctre commented 8 months ago

I have confirmed that this issue no longer appears to reproduce with the 2024_v2.0 roboRIO image and WPILib beta-4.