libbroker: make_subscriber() SIGABRT on too many files in flare::flare()

awelzel commented 1 year ago

A failing broker::endpoint::make_subscriber() should not raise SIGABRT.

A user on Slack reported zeekctl coredumping when configuring a large number of workers. Decreasing the number of workers avoided the coredumps. Cordumps were truncated by systemd-coredump, making it difficult to figure out where the error occurred.

Running zeekctl with GDB showed the following:

...
[broker/ERROR] 2023-02-24T22:54:48.096 failed to create pipe:  caf::sec::network_syscall_failed("pipe", "Too many open files")

Thread 1 "python3" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350296000) at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350296000) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350296000) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350296000, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c99476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c7f7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff6618755 in broker::detail::flare::flare() [clone .cold] () from /usr/local/zeek/lib/libbroker.so.4
#6  0x00007ffff6937dbf in broker::subscriber::make(broker::endpoint&, std::vector<broker::topic, std::allocator<broker::topic> >, unsigned long) () from /usr/local/zeek/lib/libbroker.so.4
#7  0x00007ffff67af55a in broker::endpoint::make_subscriber(std::vector<broker::topic, std::allocator<broker::topic> >, unsigned long) () from /usr/local/zeek/lib/libbroker.so.4
#8  0x00007ffff6cea40d in ?? () from /usr/local/zeek/lib/zeek/python/broker/_broker.so
#9  0x00007ffff6ccdae3 in ?? () from /usr/local/zeek/lib/zeek/python/broker/_broker.so
#10 0x00005555556afb5e in ?? ()
#11 0x00005555556befcb in PyObject_Call ()
#12 0x00005555556a67db in _PyObject_MakeTpCall ()
#13 0x000055555569f48e in _PyEval_EvalFrameDefault ()

When running zeekctl without gdb, the following message wasn't visible (might be a zeekctl thing). It would have helped otherwise.

[broker/ERROR] 2023-02-24T22:54:48.096 failed to create pipe: caf::sec::network_syscall_failed("pipe", "Too many open files")

Having that error log is good, but libbroker.so API functions should not raise SIGABRT or any other signals in case of errors, particularly those that could be recovered or better reported by the embedding code. An abort() will take down the whole process making it hard to understand where the issue was. Rather, the error should be reported as exception or error code back to the caller.

I presume the broker websocket interface makes addressing this a bit mood, as embedding libbroker will become rare in the future, but creating the ticket for reference.

Neverlord commented 1 year ago

I presume the broker websocket interface makes addressing this a bit mood

I'm not so sure about that. If we replace subscribers with WebSocket clients, it'll run into the same issues. Each WebSocket connection still opens up a socket at the end of the day. How much is a "large number of workers"? I suppose well below 1k?

IIRC, the limit for open file handles is ~1k by default. I'm just assuming that the number of workers is well below that. How did Zeek end up reaching that limit in the first place? Is it maybe using one consumer per peer/topic? Because they are using flares (i.e., pipes), users aren't really supposed to have a large number of these.

Even if we changed make_subscriber to throw an exception: exhausting all file handles probably leads to something crashing (good, because it's at least telling you what's wrong) or behaving in very strange and buggy ways (bad, because extremely hard to debug).

awelzel commented 1 year ago

How much is a "large number of workers"? I suppose well below 1k?

The configuration was 9 worker nodes with 16 workers each which triggered the error. Less than that was okay. Possibly doubling both might not be unrealistic in a very large multi-worker cluster and that's then already ~640 workers (individual Zeek processes).

How did Zeek end up reaching that limit in the first place? Is it maybe using one consumer per peer/topic?

It's the zeekctl Python process that reached the limit. That just peers with each of the individual processes on demand IIUC:

https://github.com/zeek/zeekctl/blob/266890af6e78c16b64734c68b55147d17b3a019b/ZeekControl/events.py#L63-L67

I hope this helps a bit.

Even if we changed make_subscriber to throw an exception: exhausting all file handles probably leads to something crashing (good, because it's at least telling you what's wrong) or behaving in very strange and buggy ways (bad, because extremely hard to debug).

Yeah, crashing would definitely be preferred.

zeek / broker

libbroker: make_subscriber() SIGABRT on too many files in flare::flare() #312