zeromq / netmq

A 100% native C# implementation of ZeroMQ for .NET
Other
2.96k stars 743 forks source link

PollerBase.ExecuteTimers() NullReferenceException #1010

Open valeriob opened 2 years ago

valeriob commented 2 years ago

Environment

NetMQ Version:    4.0.1.8
Operating System: Windows
.NET Version:     dotnet 6

Expected behaviour

Not killing the process :D

Actual behaviour

When this occurs, the process crashes badly

Steps to reproduce the behaviour

We frequently start and stop endpoints with a backend and frontend, with a poller for async api execution, it happens more frequently if this code is used

            _poller.Remove(_frontend);
            _poller.Remove(_backend);
            _poller.Stop();

instead of

            _poller.Stop();
            _poller.Remove(_frontend);
            _poller.Remove(_backend);

It only happens if I enable timeouts on the frontend socket

_frontend.Options.HeartbeatInterval = TimeSpan.FromSeconds(5);
_frontend.Options.HeartbeatTimeout = TimeSpan.FromSeconds(1);

Is it possible that the code than handle timers is not really robust to this kind of connect/disconnect events ? For example what i think is happening is that the timer collection is modified when this method runs :

image

image

image

Thank you Valerio

valeriob commented 2 years ago

I gave it a try for a few days, but the best I could do is mitigate the problem via https://github.com/zeromq/netmq/pull/1011 . It's not clear to me the whole mechanism, it looks like that more than one timer is created (for the same Sink and Id), and that the cancel timer is not very robust (it only cancel the first it finds).