wpilibsuite / 2024Beta

Repository for Beta Testing of 2024 Software
32 stars 5 forks source link

2024 Beta RIO 1 Out-Of-Memory's after some deploys #39

Open CoryNessCTR opened 11 months ago

CoryNessCTR commented 11 months ago

Describe the bug After a couple java project deploys, on a roboRIO 1, the DS will report an out of memory exception

To Reproduce Steps to reproduce the behavior:

  1. Format/power cycle roboRIO 1
  2. Create a new Timed Robot Skeleton Java Project
  3. Construct a Talon object with PWM channel 0
  4. Deploy project to roboRIO 1
  5. Increment channel
  6. Repeat steps 4-5
  7. Eventually (in less then 10 repeats), get the following error:
    OpenJDK Client VM warning: INFO: os::commit_memory(0xb0000000, 4194304, 0) failed; error='Not enough space' (errno=12)
    #
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (mmap) failed to map 4194304 bytes for committing reserved memory.
    # An error report file with more information is saved as:
    # /tmp/hs_err_pid7540.log

Expected behavior Out of memory exception does not occur.

Desktop (please complete the following information):

Additional context I collected memory information before and after each deploy, available as a zip below: Deploy 0 is collected immediately after power cycling the roboRIO, Deploy 5 is after the Out of Memory error occurred. MemoryIssues.zip

I've also attached the log file of the out of memory error: hs_err_pid7540.log

I've also repeated this experiment on the 2023_v3.2 image for a comparison, and stopped my testing after 30 consecutive deploys without issue. This appears to be a new or worsened issue for the 2024 libraries.

EyalKeysar commented 9 months ago

We have also encountered this issue, and we found a way to solve it temporarily until NI releases an update. I want to clarify that this solution is not official.

From what we understand the issue is caused because of multiple processes that take a lot of memory. To see what processes are currently running in the roboRIO you first need to connect with SSH to the roboRIO (https://docs.wpilib.org/en/stable/docs/software/roborio-info/roborio-ssh.html). Once you are connected to the roboRIO with SSH you can view the currently running processes using the "top" command (https://man7.org/linux/man-pages/man1/top.1.html). Now you can see that few processes take more memory than others, these processes are the processes that run when you deploy, ideally, they should not run after you deploy again but they do, and because of this after a few deploys you get this error. To solve this we are killing these processes when we get this error and it solves the problem. To find the specific process we want to kill we use the "grep" command like this: top | grep "JRE" the output of this command is every process that has "JRE" in its "top" attributes. Now remember the PIDs (Process ID) of the output processes. So now to kill the processes we need to use the "kill" command (https://man7.org/linux/man-pages/man1/kill.1.html), So if the PID is 2230 we will use it like this: kill -9 2230 Run this command for every PID that you got from the filtered top (top | grep "JRE") command. This should solve the problem. In this example the PID is 4962: image

calcmogul commented 9 months ago

You could use this instead to force-kill all processes with JRE in their name:

pgrep JRE | xargs kill -9

The following may work for remote kill, but I haven't tested whether ssh allows embedding pipes like that.

ssh admin@10.te.am.2 'pgrep JRE | xargs kill -9'
EyalKeysar commented 9 months ago

When connected to roboRIO with USB the IP you want to SSH to is 172.22.11.2. (https://docs.wpilib.org/he/stable/docs/software/roborio-info/roborio-ssh.html)

aaronleetw commented 9 months ago

We are also getting this issue. I'll test the remote kill ssh command.

Crossle86 commented 9 months ago

The problem in #40 is likely related but the symptoms are not quite the same... Some observations: Killing the JRE process just causes another to be started in its place. From what I can see, that restart sometimes helps sometimes not.

Download of code seems to always be successful, the problem appears to be in the startup of the code. Sometimes starts ok, most other times starts with trash in the riolog or an incomplete riolog or an apparently good startup but starts logging lots of errors from CAN devices. Power off then on works.

aaronleetw commented 9 months ago

I'm not sure if it is related, but the "Restart Robot Code" option also does not work regardless of its state.

sciencewhiz commented 8 months ago

Does this still occur with the WPILib beta 4?

Crossle86 commented 8 months ago

Have not had a chance to test B4 yet. Not sure when I can do it now that xmas is here. Will try sometime next week.

stephenjust commented 8 months ago

I'm still reproducing this on the Kickoff release

aaronleetw commented 8 months ago

I am still reproducing this issue, albeit much less, in the kickoff release. After three days of testing, it failed one time.

Crossle86 commented 7 months ago

Our team has not seen any problems with deployment since kickoff release.

JaiCode08 commented 6 months ago

Hello. This issue for me is still occurring. I'm not doing any heavy logging or heavy computing on the roboRIO. The memory leaks occasionally happened in WPILib 2024.2.1 but has gotten worse with 2024.3.1. The roboRIO is on the latest firmware.

nkalupahana commented 6 months ago

We're also having this issue whenever we add any sort of logging to our code: https://github.com/FRC-7525/2024-Robot

Crossle86 commented 6 months ago

An update on this for our team. We stopped having the fail to deploy issue and things seemed normal until we started loading Autos created with PathPlanner. With only a couple Autos we started having out of memory errors to the point we bit the bullet and took a RIO2 out of last year's robot and that solved the out of memory issue. I was being cheap trying to use a RIO1 for this years robot.