Fix issue where message is assumed to contain correct length byte (leading to process exit)

mnmr commented 1 year ago

Fixes #1049

The original code only verified the length after calling GetString, but never gets there if the subsequent call to encoding.GetString fails with an ArgumentOutOfRangeException. This change moves the verification to before the call to GetString.

The second commit modifies the Proactor.Loop method so that it does not trigger a process exit due to an unhandled exception when an error occurs. To this end, it no longer throws an ArgumentOutOfRangeException if an invalid completion status is encountered (incidentally, this is the same error thrown by the GetString method for the earlier bug), and it catches all exceptions to allow the loop to continue in case of a failure.

drewnoakes commented 1 year ago

The length check looks good, though adding a unit test for the IsCommand method seems like a good idea.

I'm less sure about the exception handling change. With this, failures go unreported and potentially undetected. Also, changing from catch (TerminatingException) to catching all exceptions also feels like a bad idea, at least by default. What is the actual scenario you're concerned about here?

Ideally this would be two PRs so we could discuss each change separately. They seem unrelated.

mnmr commented 1 year ago

The length check looks good, though adding a unit test for the IsCommand method seems like a good idea.

I have created some unit tests to validate the changes made. I started out by making tests to certify the old behavior, where NetMQ would throw an exception if the length byte in a command message was invalid. I then proceeded to verify that the new code works also in this scenario, without throwing an exception. I've included the test case proving the old behavior (uncommented) to make it easier for you to review.

I'm less sure about the exception handling change. With this, failures go unreported and potentially undetected. Also, changing from catch (TerminatingException) to catching all exceptions also feels like a bad idea, at least by default. What is the actual scenario you're concerned about here?

Any uncaught exception in the Proactor.Loop method will cause a process exit, even if it caused by something benign like being unable to correctly parse incoming data (like in this scenario with the IsCommand method). Unless the failure is truly catastrophic, it is very unlikely that this behavior is desirable. And when things truly are catastrophic, chances are that things will fail elsewhere too, making it less important for this code to tear down the process.

The solution I'm working on makes use of long-lived services and having them abruptly exit because they receive something unepected on the wire is quite problematic. This change was meant to protect against any other undiscovered bugs that might trigger an exception.

Ideally this would be two PRs so we could discuss each change separately. They seem unrelated.

I can split them up if you think it's better, but we are already using a custom-built NetMQ with the changes applied in production, so for me there is no immediate rush to get a new release out.

Irrespective of whether we split the PR or not, here are some possible paths forward to consider:

Revert the exception change and hope there isn't anything else that could potentially result in an exception.
Keep the change as-is, despite the mentioned flaws of catching everything.
Keep the change but add error reporting to surface any errors happening. Tthis would like imply referencing Microsoft.Extensions.Logging.Abstractions and passing an ILogger/ILogFactory on the ctor of various classes throughout NetMQ.
Keep the change but add a pluggable mechanism where users can inject or register an error handler (i.e. a method that consumes an Exception and reports back whether it has been handled or not; unhandled exceptions are then thrown). This allows us to keep the old behavior while still making exceptions catch-able for users that choose to provide a custom handler.

I think option 3 is the least attractive of these due to the number of files that likely have to be modified.

Option 4 is much less invasive, as this can be achieved with a static RegisterExceptionHandler on the Proactor, but it would still introduce new logic and thus potentially bugs. I'm not sure if there are other places in the code where a similar hook would be beneficial, but it seems like a thing you'd want to register once and have it apply everywhere relevant. Given my limited knowledge of the internals of NetMQ, going down this route would likely require someone to help me identify those places. This option likely also requires new documentation, so that users can find information on how it works and how to use it.

I'm biased towards option 2, obviously, with 1 being my second preference (mostly because options 3 and 4 seem like more work than justified for this), but let me know what you think and I will proceed accordingly.

drewnoakes commented 1 year ago

Thanks very much for adding tests.

Crashing due to a bug in NetMQ is expected. The proper course of action is to fix such bugs. Swallowing exceptions makes them harder to discover and fix, and can lead to incorrect program behaviour. For critical systems it's better to crash than continue in an undefined state.

Catching all exceptions can break all kinds of scenarios and isn't something we should be doing.

I did consider some designs to surface these exceptions but because Proactor is an internal class it's harder to plumb these through to public API.

mnmr commented 1 year ago

Thanks very much for adding tests.

It's me thanking you for helping to get the fix published.

Crashing due to a bug in NetMQ is expected. The proper course of action is to fix such bugs. Swallowing exceptions makes them harder to discover and fix, and can lead to incorrect program behaviour. For critical systems it's better to crash than continue in an undefined state.

I have reverted the commit that modified Proactor.

I did consider some designs to surface these exceptions but because Proactor is an internal class it's harder to plumb these through to public API.

It could be as simple as a public static NetMQExceptionHandler class, where people can register the error handler and anything inside NetMQ that needs it can easily grab it. Though there may be other concerns I haven't considered. Maybe not important now that we're opting for option 1 :)

drewnoakes commented 1 year ago

This looks great and I'll get it merged in.

If you wanted to create a new PR that explored having an opt-in way to handle exceptions that would otherwise crash the process, that would be great. As you mention, there could be a new static class. Or perhaps a static event on NetMQConfig. It'd have an event args with the exception, and a bool property to control whether to consider the exception handled or not (similar to CancelEventArgs).

mnmr commented 1 year ago

If you wanted to create a new PR that explored having an opt-in way to handle exceptions that would otherwise crash the process, that would be great. As you mention, there could be a new static class. Or perhaps a static event on NetMQConfig. It'd have an event args with the exception, and a bool property to control whether to consider the exception handled or not (similar to CancelEventArgs).

Using NetMQConfig would make sense for this. That said, I think we need to run into another crash-causing bug before I'd go any further with this (we've been using NetMQ for years already, and it's generally quite stable so don't expect much benefit from this work).

drewnoakes commented 1 year ago

Released in https://www.nuget.org/packages/NetMQ/4.0.1.11

Thanks again @mnmr!

zeromq / netmq

Fix issue where message is assumed to contain correct length byte (leading to process exit) #1050