zigzap / zap

blazingly fast backends in zig
MIT License
1.99k stars 71 forks source link

TLS - Random requests stalling & occasional segfault [raspi aarch64] #64

Closed Vemahk closed 4 months ago

Vemahk commented 6 months ago

Yo. I'm still looking into this and am going to start digging into zap and facil to see if I can't see what's going on, but I wanted to drop what I'm seeing here first so I don't lose it. I'm also going to try and set up a mini version of this in my efforts to narrow down what's happening... I'll be back here to update with more info, but until then...

(reproducible version @ https://github.com/Vemahk/zap-issue-64)

After plugging in TLS and getting the certs squared away, I immediately noticed maybe a third of the requests made to the server would stall for about 6 seconds, and there was an accompanying "UUID error". The client gets the right response after the stall, but it looks like on the server side, it fails and retries internally:

192.168.0.100 - - [Sat, 06 Jan 2024 03:59:21 GMT] "GET / HTTP/1.1" 200 416b 577us
192.168.0.100 - - [Sat, 06 Jan 2024 03:59:27 GMT] "GET / HTTP/1.1" 200 416b 873us
192.168.0.100 - - [Sat, 06 Jan 2024 03:59:27 GMT] "GET /time HTTP/1.1" 200 33b 649us
UUID error: 0x810 (0)
No errno handler: Success

Above, the first request to / was replayed 6 seconds later. The UUID error I think is just late to the party, and the /time call snuck in.

I looked around in zap and found that the UUID error is probably coming from fio.c:3035, which looks to be something to do with actually writing to the response stream...

Some extra notes: it seems like which requests stall are relatively random. I can spam refresh and some make it through, some stall. Eventually, I segfaulted it by continuing to spam the refresh from the browser:

192.168.0.100 - - [Sat, 06 Jan 2024 03:49:28 GMT] "GET / HTTP/1.1" 200 416b 671us
Segmentation fault at address 0xfff06bf95fb0
???:?:?: 0xffff9d9d28d8 in ??? (libcrypto.so.3)
Unwind information for `libcrypto.so.3:0xffff9d9d28d8` was not available, trace may be incomplete

???:?:?: 0x0 in ??? (???)
run http-test: error: the following command terminated unexpectedly:
cd /home/vemahk/coding/zig/http-test/zig-out && /home/vemahk/coding/zig/http-test/zig-out/bin/http-test
Build Summary: 7/9 steps succeeded; 1 failed (disable with --summary none)
run transitive failure
└─ run http-test failure
error: the following build command failed with exit code 1:
/home/vemahk/coding/zig/http-test/zig-cache/o/f6f8278f7afc1bca5b5ecd21b7918d65/build /home/vemahk/.apps/zig/zig /home/vemahk/coding/zig/http-test /home/vemahk/coding/zig/http-test/zig-cache /home/vemahk/.cache/zig run

Which is when I thought I should come back here and start writing things down.

The segfault happened inside libcrypto, which makes me think that it could possibly be something screwy with the version of libcrypto that I have? Geeze, I hope not.

Anyway, gonna go back to investigating. I'll come back to update with anything else I find.

Vemahk commented 6 months ago

Changed nothing and got a different error after spam refreshing...

192.168.0.100 - - [Sat, 06 Jan 2024 04:10:29 GMT] "GET / HTTP/1.1" 200 416b 692us
malloc(): unsorted double linked list corrupted
run http-test: error: the following command terminated unexpectedly:

... not what I was hoping to see, honestly.

Edit: adding

free(): corrupted unsorted chunks

There aren't any line numbers or files with these, I'm not omitting them :(

Vemahk commented 6 months ago

Right, so, here's what I got before I get some zzz's.

https://github.com/Vemahk/zap-issue-64

3 commits, 1 to start, 2 to get the https sample in there (which does not have the issue), and 3 to break it (adding .public_folder w/ favicon).

My conclusion so far is it either...

  1. Has something weird to do with how facil/zap is pulling from the public_folder, or
  2. Has something weird to do with handling concurrent requests?

Not terribly sure to be honest.

Either way, I'm kinda tired and am gonna go to sleep. Will probably be staring at this tomorrow some, too.

Edit: well, I also went and made the change directly to the https example in zap, ran it from zap, and it worked just fine. ... what? So could it be a build issue? I'm so confused at this point 😢

Vemahk commented 6 months ago

So I ended up being able to work around the issue by implementing a (crude) custom .public_folder check. So it's looking like this issue is specifically with facil's .public_folder implementation?

https://github.com/Vemahk/zap-issue-64/tree/user-fix-1

renerocksai commented 5 months ago

Please describe the system on which you're seeing the above errors: Linux / macOS / ..? What architecture?

Also, what's the status now, and with your custom public folder implementation. Do you still see stalling / segfaults? Were the segfaults real segmentation faults or "just" crashes / exits?

Vemahk commented 5 months ago

Running Ubuntu 22.04.3 LTS aarch64 on my raspberry pi 4 (it's my only linux machine I have up an running atm).

After implementing the custom public_folder implementation, I could no longer cause the stalling and segfaulting to occur.

I believe it was actually a segfault:

Segmentation fault at address 0xfff06bf95fb0
???:?:?: 0xffff9d9d28d8 in ??? (libcrypto.so.3)
Unwind information for `libcrypto.so.3:0xffff9d9d28d8` was not available, trace may be incomplete

???:?:?: 0x0 in ??? (???)

I did get a couple other memory related errors from malloc and free, though, on different runs:

malloc(): unsorted double linked list corrupted
free(): corrupted unsorted chunks
Vemahk commented 5 months ago

Well, I just had a test run with my public_folder changes that crashed with the free(): ... message from above. It doesn't seem to have the stalling issue, though, so the two may be different issues.

The stalling also doesn't seem to be happening with cURL, just my browser, so that might also be another bad lead T.T

renerocksai commented 5 months ago

Thx for the update. I have never seen public folder related issues before in zap but haven't used them extensively. Just for "fun", have you tried disabling openssl and see if the erroneous behaviors still get triggered?

Just some ideas for reproducibility and gathering more info: You could use some (comptime) consts to configure your example: debug / release mode(s), with/without openssl, with builtin/custom public_folder, etc, add a curl-based rapid-fire client script to try to trigger the behavior, etc. And report the constellations that others could try on their systems.

It seems like your crypto libs are strange. Or sth somewhere passes in rubbish data.

If it's browser-only: which browser? from what machine? maybe all that can help.

Vemahk commented 5 months ago

have you tried disabling openssl and see if the erroneous behaviors still get triggered?

Yes. More specifically, I was working on my project prior to the TLS changes, and I never saw this. Even on 0.2.5 without TLS enabled, there was no stalling or segfaulting (granted, I haven't tried my spam-curl script on it since then- lemme go do that). (tested it and no issues)

It seems like your crypto libs are strange. Or sth somewhere passes in rubbish data.

It's possible they're borked somehow. Though I should note: sometimes the segfault reports a different .so than libcrypto. Including, but not limited to, libc and ????? 😅

If it's browser-only: which browser? from what machine? maybe all that can help.

Brave connecting over the (local) network. My raspberry pi is terminal only, so I'm developing/running from the pi and connecting to it from my main windows PC using Brave.

I'll tinker some more with the reproducibility ideas. Every time I've had a theory so far, it's kinda had a few holes in it. I'm not convinced of the public_folder stuff anymore 😬, though it did seem to alleviate the stalling phenomina. Thanks for the ideas.

renerocksai commented 5 months ago

BTW here is my implementation taken from my FrontendEndpoint: it's a custom thing (using the Endpoint API but you can just re-use the get method impl. It makes sure you can't play relative path tricks to escape out of the public folder, and uses facilio's sendfile implementation, disabling caching:

const std = @import("std");
const zap = @import("zap");

const Self = @This();

allocator: std.mem.Allocator,
endpoint: zap.SimpleEndpoint,
debug: bool = true,
settings: Settings,
www_root_cage: []const u8,
frontend_dir_absolute: []const u8,

fn log(self: *Self, comptime fmt: []const u8, args: anytype) void {
    if (self.debug) {
        std.debug.print("[frontend endpoint] - " ++ fmt, args);
    }
}

pub const Settings = struct {
    allocator: std.mem.Allocator,
    endpoint_path: []const u8,
    www_root: []const u8,
    index_html: []const u8,
};

pub fn init(settings: Settings) !Self {
    // endpoint path = frontend_dir
    if (settings.endpoint_path[0] != '/') {
        return error.FrontendDirMustStartWithSlash;
    }

    var ret: Self = .{
        .allocator = settings.allocator,
        .endpoint = zap.SimpleEndpoint.init(.{
            .path = settings.endpoint_path,
            .get = getFrontend,
            .post = null,
            .put = null,
            .delete = null,
        }),
        .settings = settings,
        .frontend_dir_absolute = undefined,
        .www_root_cage = undefined,
    };

    // remember abs path of www_root
    ret.www_root_cage = try std.fs.cwd().realpathAlloc(ret.allocator, settings.www_root);
    std.log.info("Frontend: using www_root: {s}", .{ret.www_root_cage});

    // check for endpoint_path within www_root_cage
    const root_dir = try std.fs.cwd().openDir(ret.www_root_cage, .{});

    // try to find the frontend subdir = endpoint_path without leading /
    const frontend_dir_stat = try root_dir.statFile(settings.endpoint_path[1..]);
    if (!(frontend_dir_stat.kind == .directory)) {
        return error.NotADirectory;
    }

    // create frontend_dir_absolute for later
    ret.frontend_dir_absolute = try root_dir.realpathAlloc(ret.allocator, settings.endpoint_path[1..]);
    std.log.info("Frontend: using frontend root: {s}", .{ret.frontend_dir_absolute});

    // check if frontend_dir_absolute starts with www_root_absolute
    // to avoid weird linking leading to
    if (!std.mem.startsWith(u8, ret.frontend_dir_absolute, ret.www_root_cage)) {
        return error.FrontendDirNotInRootDir;
    }

    return ret;
}

pub fn deinit(self: *Self) void {
    self.allocator.free(self.frontend_dir_absolute);
    self.allocator.free(self.www_root_cage);
}

pub fn getEndpoint(self: *Self) *zap.SimpleEndpoint {
    return &self.endpoint;
}

fn getFrontend(e: *zap.SimpleEndpoint, r: zap.SimpleRequest) void {
    const self = @fieldParentPtr(Self, "endpoint", e);
    self.getFrontenInternal(r) catch |err| {
        r.sendError(err, 505);
    };
}

fn getFrontenInternal(self: *Self, r: zap.SimpleRequest) !void {
    var fn_buf: [2048]u8 = undefined;
    if (r.path) |p| {
        var html_path: []const u8 = undefined;
        var is_root: bool = false;

        // check if we have to serve index.html
        if (std.mem.eql(u8, p, self.settings.endpoint_path)) {
            html_path = self.settings.index_html;
            is_root = true;
        } else if (p.len == self.settings.endpoint_path.len + 1 and p[p.len - 1] == '/') {
            html_path = self.settings.index_html;
            is_root = true;
        } else {
            // no
            html_path = p;
        }

        if (std.fmt.bufPrint(&fn_buf, "{s}{s}", .{ self.www_root_cage, html_path })) |fp| {
            // now check if the absolute path starts with the frontend cage
            if (std.mem.startsWith(u8, fp, self.frontend_dir_absolute)) {
                try r.setHeader("Cache-Control", "no-cache");
                try r.sendFile(fp);
                return;
            } // else 404 below
        } else |err| {
            std.debug.print("Error: {}\n", .{err});
            // continue with 404 below
        }
    }

    r.setStatus(.not_found);
    try r.setHeader("Cache-Control", "no-cache");
    try r.sendBody("<html><body><h1>404 - File not found</h1></body></html>");
}

I've used that in production with thousands of clients, hundreds of concurrent clients successfully in many projects.

renerocksai commented 5 months ago

One more thing: openssl, the command, can be used as a server. You could try to check if you see similar behavior with that:

Serving the current dir:

$ openssl s_server -accept 4443 -cert mycert.pem -key mykey.pem -WWW

(...and it's bedtime for me )

renerocksai commented 4 months ago

Closing this since it seems more like a raspy / aarch64 / FS kind of problem. Feel free to re-open.

richard-powers commented 1 month ago

I don't like necroing a thread, but I'm seeing the exact same issue, only while using TLS for our web server.

I've reproduced this on our live VPS: Ubuntu 22.04.4 LTS, openssl version 3.0.10, as well as my local machine: Arch Linux with openssl version 3.3.0 (both are x86_64), zig version 0.12.0

Segmentation fault at address 0x30
???:?:?: 0x740f2c26ed64 in ??? (libcrypto.so.3)
Unwind information for `libcrypto.so.3:0x740f2c26ed64` was not available, trace may be incomplete

As well as the previously described errors.


Additional info:

It seems to only be crashing when using multiple threads (only 1 worker). We normally use 200 threads with 1 worker, and while doing this I was able to crash the server just short of 500 queries.

After limiting the server to 1 thread and 1 worker, I could no longer cause a crash. (It also seems like less threads = less likely to crash? This is difficult to verify) I would normally believe this to be a problem somewhere with my code, but the server will not crash if TLS is disabled.


I am going to soon attempt to create a min repro so we can further investigate what may be causing this problem.

renerocksai commented 1 month ago

This is most interesting. Please keep investigating. I've had stuff in production for months, with relative high loads but never more threads than cores: like 16 or less worker threads on the server (fast enough) per app. Also, the servers are zig 0.11.0 ones (older zap versions) and I have nginx dealing with SSL.

I just recalled, having done some load tests for the SYCL app last year, so I might want to port that to current and use that as a basis for more elaborate tests (note to self).

richard-powers commented 1 month ago

It's just about the end of the day Friday for me, so I'll be working on that min-repro Monday morning. We're using zig 0.12.0 on our servers just FYI, I'll keep you posted.

renerocksai commented 1 month ago

Thx. I'm on my way to SYCL in Milan, so I'll remain in CET. I doubt I'll have much time next week as it's a conference week. Just fyi

richard-powers commented 1 month ago

I was able to recreate it with the https example simply by changing this:

zap.start(.{
    .threads = 200,
    .workers = 1,
});

I then ran a bit of bash to flood the endpoint with curl:

i=1
while true; do
  curl -k -X 'GET' 'https://0.0.0.0:4443' || break
  echo "$i"
  i=$((i+1))
  sleep 0.05;
done

First time trying, i was 1858 when it crashed. It isn't consistent, so you may need to let it run a little while, but it will crash eventually

richard-powers commented 1 month ago

Should we make a new issue to track this problem, or perhaps reopen this one? I suspect it has something to do with facil.io under the hood, but haven't found anything concrete, yet.

renerocksai commented 1 month ago

Yes pls. You choose

richard-powers commented 1 month ago

Created https://github.com/zigzap/zap/issues/107