Closed MCFurry closed 3 years ago
Hey, thanks for the positive feedback :)
Yes, I'd welcome a PR :+1:
Do you see a use case where a specific number of restarts is OK? Or should we just introduce a boolean attribute rosmon-expect-restarts=true/false
which completely disables the diagnostic warning?
Alternative name (with inverted semantics): rosmon-warn-on-crash=true/false
Alright, I'll be working on that soon!
I was thinking about a number since a big number of restarts still might indicate issues. Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?
Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?
Sounds good to me. To be clear, I assume we are talking about a new attribute on the <node>
tag, like
<node name="restarter" pkg="restarter" type="restarter" rosmon-crash-warn-threshold="-1">
</node>
Then this limit can be configured per node.
I would propose rosmon-restart-warn-threshold
or something similar. The rosbag node shuts down gracefully by design. So crash sounds a bit heavy.
Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?
Sounds good to me. To be clear, I assume we are talking about a new attribute on the
<node>
tag, like<node name="restarter" pkg="restarter" type="restarter" rosmon-crash-warn-threshold="-1"> </node>
Then this limit can be configured per node.
Configuration per node sounds nice! And I agree with Timple that the restarts don't necessarily mean a crash.
Well, from a process monitoring perspective the rosbag node does not shut down gracefully - it exits with a non-zero status code, which usually indicates some sort of failure. If it were to exit with status 0 (everything ok), rosmon would not restart it.
So in the end I'd say that remote_rosbag_record
uses the respawn mechanism in a way that was not intended. That doesn't necessarily mean it's a bad thing, though ;)
I'm fine with calling it rosmon-respawn-warn-threshold
(respawn is more consistent with the roslaunch XML attributes, see https://wiki.ros.org/roslaunch/XML/node).
Oh, sorry, I was completely wrong. The logic even respawns nodes that exited with status 0. Forget I said anything ;)
Ow dear... There goes a whole summary of testing to show it had exit status 0 :grin:
Well, that teaches me to look at the code before making bold statements :sweat_smile:
We really like using the rosmon tool and also the generated diagnostics are very useful to assure all nodes are running in a system.
However, on rare occasions we have nodes that are expected to restart in a normal workflow. (For example remote_rosbag_record). When this happens though, diagnostics are stuck in a warning state since number of restarts is hard coded compared to 0.
Would you welcome a PR to make the number of restarts before issuing a warning parametrized? Or do you have another suggestion?