zerotier / libzt

Encrypted P2P sockets over ZeroTier
https://zerotier.com
Other
183 stars 54 forks source link

ENFILE (file table overflow) after running selftest for a few hours (unix port) #27

Open joseph-henry opened 6 years ago

joseph-henry commented 6 years ago

It seems that sometimes lwIP will report that there are too many files open on the system ENFILE, but this seems to be only from lwIP's connection allocator and not a system-wide issue.

I suspect a maximum number of descriptors is being issued to a lwIP netconn and they aren't being freed properly by free_socket().

robinsloan commented 2 years ago

I know this is an old issue, but I just ran into this running libzt overnight, and I'd appreciate some guidance, if you can spare it. To your knowledge, does this lwIP bug/state prevent new connections? (If so, maybe I could set up some logic to restart the ZT node when ENFILE appears in zts_errno?) Or can it be ignored?

joseph-henry commented 2 years ago

No problem. It's been a while since I've poked around in that area of the code but I do believe that would limit the creation of new connections so a restart would be necessary. Do you know what the number of the last successfully created fd was? I think a general limit of 1024 exists and can be adjusted by configuring MEMP_NUM_* constants in src/lwipopts.h. If this does seem to be your issue I can consider bumping that number up a bit.

robinsloan commented 2 years ago

I just ran a test program this afternoon: zts_node_start followed by zts_net_join (an ad-hoc network) and then I let it sit, with a little loop that, once a minute, makes a zts_udp_server and closes it immediately. I log the returned fd, which is (as expected) always 0.

I saw my first one of these

socket(unix): Too many open files
socket: Too many open files

after about two hours. Then -- I don't know if this is interesting or not -- the program loops placidly again, without errors, until I get another Too many open files, exactly five minutes later. Then it repeats. An odd little cycle; something's on a timer inside libzt or lwIP?

Even in the loops that produce a Too many open files message, I get back an fd of 0 rather than an error code; of course, I don't know what would happen if I tried to use that socket.

(I also get bursts of recv: Connection reset by peer messages on a five-minute cadence, but I have always seen those & figured they were benign libzt diagnostic messages. Just mentioning for the sake of completeness.)