Option to accept node respawns

MCFurry commented 3 years ago

We really like using the rosmon tool and also the generated diagnostics are very useful to assure all nodes are running in a system.

However, on rare occasions we have nodes that are expected to restart in a normal workflow. (For example remote_rosbag_record). When this happens though, diagnostics are stuck in a warning state since number of restarts is hard coded compared to 0.

Would you welcome a PR to make the number of restarts before issuing a warning parametrized? Or do you have another suggestion?

xqms commented 3 years ago

Hey, thanks for the positive feedback :)

Yes, I'd welcome a PR :+1: Do you see a use case where a specific number of restarts is OK? Or should we just introduce a boolean attribute rosmon-expect-restarts=true/false which completely disables the diagnostic warning?

xqms commented 3 years ago

Alternative name (with inverted semantics): rosmon-warn-on-crash=true/false

MCFurry commented 3 years ago

Alright, I'll be working on that soon!

I was thinking about a number since a big number of restarts still might indicate issues. Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?

xqms commented 3 years ago

Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?

Sounds good to me. To be clear, I assume we are talking about a new attribute on the <node> tag, like

<node name="restarter" pkg="restarter" type="restarter" rosmon-crash-warn-threshold="-1">
</node>

Then this limit can be configured per node.

Timple commented 3 years ago

I would propose rosmon-restart-warn-threshold or something similar. The rosbag node shuts down gracefully by design. So crash sounds a bit heavy.

MCFurry commented 3 years ago

Then a value of -1 could mean 'infinite restarts allowed', hence disabling the check?

Sounds good to me. To be clear, I assume we are talking about a new attribute on the <node> tag, like
<node name="restarter" pkg="restarter" type="restarter" rosmon-crash-warn-threshold="-1">
</node>
Then this limit can be configured per node.

Configuration per node sounds nice! And I agree with Timple that the restarts don't necessarily mean a crash.

xqms commented 3 years ago

Well, from a process monitoring perspective the rosbag node does not shut down gracefully - it exits with a non-zero status code, which usually indicates some sort of failure. If it were to exit with status 0 (everything ok), rosmon would not restart it.

So in the end I'd say that remote_rosbag_record uses the respawn mechanism in a way that was not intended. That doesn't necessarily mean it's a bad thing, though ;)

I'm fine with calling it rosmon-respawn-warn-threshold (respawn is more consistent with the roslaunch XML attributes, see https://wiki.ros.org/roslaunch/XML/node).

xqms commented 3 years ago

Oh, sorry, I was completely wrong. The logic even respawns nodes that exited with status 0. Forget I said anything ;)

Timple commented 3 years ago

Ow dear... There goes a whole summary of testing to show it had exit status 0 :grin:

xqms commented 3 years ago

Well, that teaches me to look at the code before making bold statements :sweat_smile:

xqms / rosmon

Option to accept node respawns #145