zigzap / zap

blazingly fast backends in zig
MIT License
1.98k stars 71 forks source link

Segfault after 60 seconds of inactivity #86

Closed richard-powers closed 3 months ago

richard-powers commented 3 months ago

I have a server running that works totally fine, unless you wait 60+ seconds between queries, then you get a segfault:

Stack trace:

5432
Listening on 0.0.0.0:3333
127.0.0.1 - - [Thu, 28 Mar 2024 17:38:39 GMT] "GET /areas HTTP/1.1" 200 210b 64900us
Segmentation fault at address 0x790716562000
/home/user/zig/0.11.0/files/lib/compiler_rt/memcpy.zig:19:21: 0x667660 in memcpy (compiler_rt)
            d[0] = s[0];
                    ^
/home/user/zig/0.11.0/files/lib/std/mem/Allocator.zig:327:20: 0x3eed9c in dupeZ__anon_12285 (server)
    @memcpy(new_buf[0..m.len], m);
                   ^
/home/user/zig/0.11.0/files/lib/std/net.zig:858:43: 0x381891 in getAddressList (server)
        const name_c = try allocator.dupeZ(u8, name);
                                          ^
/home/user/zig/0.11.0/files/lib/std/net.zig:709:36: 0x2aa0e8 in tcpConnectToHost (server)
    const list = try getAddressList(allocator, name, port);
                                   ^
/home/user/.cache/zig/p/122008c4873ab9964cd1b0df4244a24a36602723aca86d2d601a4a23a7d06bc433d6/src/conn.zig:102:40: 0x2a9656 in open (server)
    break :blk try net.tcpConnectToHost(allocator, host, port);
                                       ^
/home/user/.cache/zig/p/122008c4873ab9964cd1b0df4244a24a36602723aca86d2d601a4a23a7d06bc433d6/src/pool.zig:218:20: 0x3786be in newConnection (server)
 conn.* = Conn.open(allocator, opts.connect) catch |err| {
                   ^
/home/user/.cache/zig/p/122008c4873ab9964cd1b0df4244a24a36602723aca86d2d601a4a23a7d06bc433d6/src/pool.zig:120:31: 0x4b5490 in release (server)
   conn_to_add = newConnection(self, true) catch {
                              ^
/home/user/.cache/zig/p/122008c4873ab9964cd1b0df4244a24a36602723aca86d2d601a4a23a7d06bc433d6/src/conn.zig:176:15: 0x446c24 in release (server)
  pool.release(self);
              ^
/home/user/programming/zig/server-repro/src/endpoints/areas.zig:56:23: 0x448162 in queryAreas (server)
    defer conn.release();
                      ^
/home/user/programming/zig/server-repro/src/endpoints/areas.zig:48:40: 0x448be6 in _get (server)
    const productCodes = try queryAreas(arenaAlloc, companyId);
                                       ^
/home/user/programming/zig/server-repro/src/endpoints/areas.zig:34:9: 0x3ba1a0 in get (server)
    _get(self, r) catch |err| {
        ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/src/endpoint.zig:64:36: 0x43d122 in onRequest (server)
        .GET => self.settings.get.?(self, r),
                                   ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/src/endpoint.zig:320:32: 0x3b9d50 in onRequest (server)
                    e.onRequest(r);
                               ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/src/zap.zig:224:27: 0x37b2be in theOneAndOnlyRequestCallBack (server)
                on_request(req);
                          ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http_internal.c:53:3: 0x645b7b in http_on_request_handler______internal (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http_internal.c)
  settings->on_request(h);
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c:553:3: 0x654011 in http1_on_request (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c)
  http_on_request_handler______internal(&http1_pr2handle(p), p->p.settings);
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/parsers/http1_parser.h:859:9: 0x65360a in http1_parse (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c)
    if (((parser->state.reserved & HTTP1_P_FLAG_RESPONSE)
        ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c:689:9: 0x652fc3 in http1_consume_data (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c)
    i = http1_parse(&p->parser, p->buf + (org_len - p->buf_len), p->buf_len);
        ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c:775:3: 0x650d45 in http1_on_data_first_time (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/http/http1.c)
  http1_consume_data(uuid, p);
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c:2213:3: 0x611755 in deferred_on_data (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c)
  pr->on_data((intptr_t)uuid, pr);
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c:1011:3: 0x61096a in fio_defer_perform_single_task_for_queue (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c)
  task.func(task.arg1, task.arg2);
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c:1049:10: 0x6108f4 in fio_defer_perform (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c)
         fio_defer_perform_single_task_for_queue(&task_queue_normal) == 0)
         ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c:3793:5: 0x616d8b in fio_worker_startup (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c)
    fio_defer_perform();
    ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c:3942:3: 0x616a0b in fio_start (/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/facil.io/lib/facil/fio.c)
  fio_worker_startup();
  ^
/home/user/.cache/zig/p/1220d4802fb09d4e99c0e7265f90d6f3cfdc3e5e31c1b05f0924ee2dd26d9d6dbbf4/src/zap.zig:94:18: 0x37c37a in start (server)
    fio.fio_start(args);
                 ^
/home/user/programming/zig/server-repro/src/main.zig:60:18: 0x37bde8 in main (server)
        zap.start(.{
                 ^
/home/user/zig/0.11.0/files/lib/std/start.zig:574:37: 0x37ca6e in main (server)
            const result = root.main() catch |err| {
                                    ^
???:?:?: 0x790716394ccf in ??? (libc.so.6)
Unwind information for `libc.so.6:0x790716394ccf` was not available, trace may be incomplete

I've tried changing the timeout option for zap.Endpoint.Listener.init (tried 0, 255, and unset) but the result is the same. Maybe it's something else?

I've created a reproduction here https://github.com/richard-powers/server-error-repro After the server is up and running, I run: http 'http://127.0.0.1:3333/areas?companyId=1'; sleep 1m; http 'http://127.0.0.1:3333/areas?companyId=1'; The segfault appears after the 2nd query

renerocksai commented 3 months ago

Set workers to 1. If you set it to 0, facilio will roughly start #cpus / 2 or so workers, so you can't share state between workers. I haven't looked too closely at what you're doing but it seems to me that this is the most likely cause.

Zap runs forever, no matter how long you wait between requests.

renerocksai commented 3 months ago

Yeah, you're sharing at least the global allocator. For multiple workers you'd need one allocator per worker process.

richard-powers commented 3 months ago

I've tried one worker and many threads, with the same issue

renerocksai commented 3 months ago

Ok, I am convinced, having many zap servers in production for nearly a year now, some with weeks between requests (bulks), this is an issue with your implementation. The GPA is probably not thread-safe by default, so you have to set its config option .thread_safe = true. Set workers to 1 or else you cannot share any allocator, thread-safe or not.

richard-powers commented 3 months ago

Is there a way to provide an allocator per thread?

renerocksai commented 3 months ago

You could define a thread llocal optional defaulting to null and setting it if it's null, for example.

I would suggest trying with the gpa set to threadsafe first.

renerocksai commented 3 months ago

I am closing this as it doesn't seem to be caused by zap. You can check out the examples and see that they run forever.

renerocksai commented 3 months ago

Mind you, even if you have one allocator per thread making your allocations and deallocations within-thread, sharing state will not work between workers processes. This has nothing to do with zap or zig. It is basic common sense. If you really know what you're doing, you could use shared memory. An easier option is an external process (like a DB or a messaged hashmap, ..., redis, ...) but all that highly depends on your specific requirements

richard-powers commented 3 months ago

I've set the allocator to be thread safe, set workers and threads to 1, and this issue still happens. Shouldn't this mean there's only 1 process and 1 thread accessing the data at any point in time? And why would this only be a problem after 60 seconds?

Apologies if it's still unrelated to the framework, but I can't find a reason why this would be happening, still.

renerocksai commented 3 months ago

If you (actually) look at the stack trace you get, you can clearly see that the error is occurring in the pg module, which explicitly states that it maintains a connection pool (so it's likely it does some connection management which could explain the 60 seconds idle thing). So why you think the error is located in zap is beyond me.

I had a brief look at pg.zig and based on the error trace, you're getting: on release of the connection, it checks the connection and concludes that it is not idle (which according to the comments is supposed to be an error, likely caused by how you're using it) -> it tries to reconnect. However, in initDatabase(), you destroy the envMap via defer. Hence, your host string pointer is invalid, causing a segfault when the std.net code tries to figure out the TCP address for the new connection. You should dupe() the strings before destroying their underlying memory. You're welcome.

Another thing I've seen the 2nd time now, is: creating an arena for every request. I don't get why people do that. That works against the performance gains you could get from an arena. You potentially even serialize all threads on the allocator that has to create the arena. I would recommend a thread-local arena in your endpoint and reset() it when done.

I am not very amused that you probably didn't check the error trace or else you wouldn't have opened an issue here. I don't like the "it must be your bug because I can't find a reason..." attitude (when not even bothering to read the error trace) and insisting on "you fix my code becaus I can't find the bug". I would prefer if people checked the errors that point them to the source. So I would somehow have understood if you had raised an issue at pg.zig or the zig standard library, based on the stack trace.

Other people's time is valuable, too.

renerocksai commented 3 months ago

Now that I got that off my chest: kudos to you for trying such a project with zig and zap. Both are relatively new and not so widespread. Oh, and unless you have super high loads, don't bother with the whole thread-local thing I like to advocate for until you're comfortable with it. Your code is likely to be fast enough. Might be thrashing memory though. But such optimizations can actually be done later, when you're happy with the overall functionality.

richard-powers commented 3 months ago

I didn't intend to suggest it was solely zap's "fault", I apologize if it came off that way. I saw the error was happening inside the pg.zig library, but thought it may be because of a misunderstanding I had with threads and workers with Zap, after reading an earlier comment of yours.

Thanks for looking into this, and for your suggestions! I'll be working to study and implement what you've mentioned.