xqms / rosmon

ROS node launcher & monitoring daemon
Other
180 stars 47 forks source link

Something about bad addresses, signal 6, and non-empty directories #115

Closed mjsobrep closed 4 years ago

mjsobrep commented 4 years ago

I have this occasional problem working with rosmon (which is otherwise incredible, thank you) where on launch ~half of my nodes crash. It seems to be a problem from an interaction with rosmon. Sometimes restarting the system solves it, sometimes it does not. I think that rosmon is building directories to keep track of something and not properly cleaning them up / not tolerating messy shutdowns? Any thoughts on how to fix this?

Snippet of the print out after a failed launch:

    image_throttled: Could not execute /opt/ros/kinetic/lib/topic_tools/throttle messages image_raw 15 image_throttled __name:=image_throttled : Bad address
     image_throttled: image_throttled died from signal 6
     image_throttled: image_throttled left a core dump
     image_throttled: Determined pattern '/tmp/rosmon-node-zKMbN7/core'
     image_throttled: Found core file '/tmp/rosmon-node-zKMbN7/core'
     image_throttled: Could not remove process working directory '/tmp/rosmon-node-zKMbN7' after process exit: Directory not empty
         mobile_base: [loadNodelet]: Loading nodelet /mobile_base of type kobuki_node/KobukiNodelet to manager mobile_base_nodelet_manager with the following remappings:
         mobile_base: [loadNodelet]: /mobile_base/joint_states -> /joint_states
         mobile_base: [loadNodelet]: /mobile_base/odom -> /odom
         mobile_base: [service::exists]: waitForService: Service [/mobile_base_nodelet_manager/load_nodelet] has not been advertised, waiting...
   realsense2_camera: Could not execute /opt/ros/kinetic/lib/nodelet/nodelet load realsense2_camera/RealSenseNodeFactory realsense2_camera_manager __name:=realsense2_camera : Bad address
realsense2_camera_manager: Could not execute /opt/ros/kinetic/lib/nodelet/nodelet manager __name:=realsense2_camera_manager : Bad address
   realsense2_camera: realsense2_camera died from signal 6
   realsense2_camera: realsense2_camera left a core dump
   realsense2_camera: Determined pattern '/tmp/rosmon-node-DiS9mQ/core'
   realsense2_camera: Found core file '/tmp/rosmon-node-DiS9mQ/core'
   realsense2_camera: Could not remove process working directory '/tmp/rosmon-node-DiS9mQ' after process exit: Directory not empty
realsense2_camera_manager: realsense2_camera_manager died from signal 6
realsense2_camera_manager: realsense2_camera_manager left a core dump
realsense2_camera_manager: Determined pattern '/tmp/rosmon-node-j6bTbd/core'
realsense2_camera_manager: Found core file '/tmp/rosmon-node-j6bTbd/core'
realsense2_camera_manager: Could not remove process working directory '/tmp/rosmon-node-j6bTbd' after process exit: Directory not empty
          polly_node: [start]: polly running: rosrpc://192.168.1.25:42125
  keyop_vel_smoother: [loadNodelet]: Loading nodelet /keyop_vel_smoother of type yocs_velocity_smoother/VelocitySmootherNodelet to manager mobile_base_nodelet_manager with the following remappings:
  keyop_vel_smoother: [loadNodelet]: /keyop_vel_smoother/odometry -> /odom
xqms commented 4 years ago

Uhh, that looks like a memory bug, since we get EFAULT ("Bad address") from execvp()... Could you please specify which ROS + rosmon version you are using? Repository packages or built from source?

I recall that we fixed a nasty bug in #103, maybe this is related.

mjsobrep commented 4 years ago

This is on Kinetic coming from repository:

Package: ros-kinetic-rosmon
Status: install ok installed
Priority: extra
Section: misc
Installed-Size: 29
Maintainer: Max Schwarz <max.schwarz@uni-bonn.de>
Architecture: amd64
Version: 2.1.1-1xenial-20200229-053707+0000
Depends: ros-kinetic-rosmon-core, ros-kinetic-rqt-rosmon
mjsobrep commented 4 years ago

A bit more intuition: It seems like all of the nodes that are having trouble are either capturing from cameras or in the same namespace as a camera capturing node. There are other nodes using USB devices that seem ok. This system is often shut down messily (the os is shutdown without regard for what is running).

mjsobrep commented 4 years ago

Looks the same as #103 which was fixed in v2.2.1. On Kinetic, 2.1.1-1 is the released version. Any chance we can get a bump in released version or should I just build from source?

xqms commented 4 years ago

Yes, sure. I always delay the Kinetic release a bit to test things on Melodic first, but this time I simply forgot to do the Kinetic update.

I triggered a release right now. The rosdistro PR is here: https://github.com/ros/rosdistro/pull/24555

mjsobrep commented 4 years ago

Awesome, thanks

xqms commented 4 years ago

The packages have built, but are not pushed into the main repositories yet. If you like & have time, you could already test them:

xqms commented 4 years ago

The package sync is out => rosmon 2.2.1 including the fix for this issue can be installed from the repositories. I'll close this now, feel free to reopen if the issue still persists :)