Closed richard-powers closed 3 months ago
Set workers to 1. If you set it to 0, facilio will roughly start #cpus / 2 or so workers, so you can't share state between workers. I haven't looked too closely at what you're doing but it seems to me that this is the most likely cause.
Zap runs forever, no matter how long you wait between requests.
Yeah, you're sharing at least the global allocator. For multiple workers you'd need one allocator per worker process.
I've tried one worker and many threads, with the same issue
Ok, I am convinced, having many zap servers in production for nearly a year now, some with weeks between requests (bulks), this is an issue with your implementation. The GPA is probably not thread-safe by default, so you have to set its config option .thread_safe = true
. Set workers to 1 or else you cannot share any allocator, thread-safe or not.
Is there a way to provide an allocator per thread?
You could define a thread llocal optional defaulting to null and setting it if it's null, for example.
I would suggest trying with the gpa set to threadsafe first.
I am closing this as it doesn't seem to be caused by zap. You can check out the examples and see that they run forever.
Mind you, even if you have one allocator per thread making your allocations and deallocations within-thread, sharing state will not work between workers processes. This has nothing to do with zap or zig. It is basic common sense. If you really know what you're doing, you could use shared memory. An easier option is an external process (like a DB or a messaged hashmap, ..., redis, ...) but all that highly depends on your specific requirements
I've set the allocator to be thread safe, set workers and threads to 1, and this issue still happens. Shouldn't this mean there's only 1 process and 1 thread accessing the data at any point in time? And why would this only be a problem after 60 seconds?
Apologies if it's still unrelated to the framework, but I can't find a reason why this would be happening, still.
If you (actually) look at the stack trace you get, you can clearly see that the error is occurring in the pg module, which explicitly states that it maintains a connection pool (so it's likely it does some connection management which could explain the 60 seconds idle thing). So why you think the error is located in zap is beyond me.
I had a brief look at pg.zig and based on the error trace, you're getting: on release of the connection, it checks the connection and concludes that it is not idle (which according to the comments is supposed to be an error, likely caused by how you're using it) -> it tries to reconnect. However, in initDatabase(), you destroy the envMap via defer. Hence, your host string pointer is invalid, causing a segfault when the std.net code tries to figure out the TCP address for the new connection. You should dupe() the strings before destroying their underlying memory. You're welcome.
Another thing I've seen the 2nd time now, is: creating an arena for every request. I don't get why people do that. That works against the performance gains you could get from an arena. You potentially even serialize all threads on the allocator that has to create the arena. I would recommend a thread-local arena in your endpoint and reset() it when done.
I am not very amused that you probably didn't check the error trace or else you wouldn't have opened an issue here. I don't like the "it must be your bug because I can't find a reason..." attitude (when not even bothering to read the error trace) and insisting on "you fix my code becaus I can't find the bug". I would prefer if people checked the errors that point them to the source. So I would somehow have understood if you had raised an issue at pg.zig or the zig standard library, based on the stack trace.
Other people's time is valuable, too.
Now that I got that off my chest: kudos to you for trying such a project with zig and zap. Both are relatively new and not so widespread. Oh, and unless you have super high loads, don't bother with the whole thread-local thing I like to advocate for until you're comfortable with it. Your code is likely to be fast enough. Might be thrashing memory though. But such optimizations can actually be done later, when you're happy with the overall functionality.
I didn't intend to suggest it was solely zap's "fault", I apologize if it came off that way. I saw the error was happening inside the pg.zig library, but thought it may be because of a misunderstanding I had with threads and workers with Zap, after reading an earlier comment of yours.
Thanks for looking into this, and for your suggestions! I'll be working to study and implement what you've mentioned.
I have a server running that works totally fine, unless you wait 60+ seconds between queries, then you get a segfault:
Stack trace:
I've tried changing the
timeout
option forzap.Endpoint.Listener.init
(tried 0, 255, and unset) but the result is the same. Maybe it's something else?I've created a reproduction here https://github.com/richard-powers/server-error-repro After the server is up and running, I run:
http 'http://127.0.0.1:3333/areas?companyId=1'; sleep 1m; http 'http://127.0.0.1:3333/areas?companyId=1';
The segfault appears after the 2nd query