Open awelzel opened 1 year ago
I presume the broker websocket interface makes addressing this a bit mood
I'm not so sure about that. If we replace subscribers with WebSocket clients, it'll run into the same issues. Each WebSocket connection still opens up a socket at the end of the day. How much is a "large number of workers"? I suppose well below 1k?
IIRC, the limit for open file handles is ~1k by default. I'm just assuming that the number of workers is well below that. How did Zeek end up reaching that limit in the first place? Is it maybe using one consumer per peer/topic? Because they are using flares (i.e., pipes), users aren't really supposed to have a large number of these.
Even if we changed make_subscriber
to throw an exception: exhausting all file handles probably leads to something crashing (good, because it's at least telling you what's wrong) or behaving in very strange and buggy ways (bad, because extremely hard to debug).
How much is a "large number of workers"? I suppose well below 1k?
The configuration was 9 worker nodes with 16 workers each which triggered the error. Less than that was okay. Possibly doubling both might not be unrealistic in a very large multi-worker cluster and that's then already ~640 workers (individual Zeek processes).
How did Zeek end up reaching that limit in the first place? Is it maybe using one consumer per peer/topic?
It's the zeekctl Python process that reached the limit. That just peers with each of the individual processes on demand IIUC:
I hope this helps a bit.
Even if we changed make_subscriber to throw an exception: exhausting all file handles probably leads to something crashing (good, because it's at least telling you what's wrong) or behaving in very strange and buggy ways (bad, because extremely hard to debug).
Yeah, crashing would definitely be preferred.
A failing
broker::endpoint::make_subscriber()
should not raise SIGABRT.A user on Slack reported zeekctl coredumping when configuring a large number of workers. Decreasing the number of workers avoided the coredumps. Cordumps were truncated by systemd-coredump, making it difficult to figure out where the error occurred.
Running
zeekctl
with GDB showed the following:When running zeekctl without
gdb
, the following message wasn't visible (might be a zeekctl thing). It would have helped otherwise.Having that error log is good, but
libbroker.so
API functions should not raise SIGABRT or any other signals in case of errors, particularly those that could be recovered or better reported by the embedding code. An abort() will take down the whole process making it hard to understand where the issue was. Rather, the error should be reported as exception or error code back to the caller.I presume the broker websocket interface makes addressing this a bit mood, as embedding libbroker will become rare in the future, but creating the ticket for reference.