ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.71k stars 2.47k forks source link

Unexpected error occurs when running a Zig program on CentOS 5.8 #20959

Open gaogaogoo opened 1 month ago

gaogaogoo commented 1 month ago

Zig Version

0.13.0

Steps to Reproduce and Observed Behavior

The code only contains a single simple output line:

const std = @import("std");

pub fn main() !void { std.debug.print("hello!\n", .{}); }

The error message is as follows:

unexpected errno: 38 C:\Users\Cheney.zvm\0.13.0\lib\std/debug.zig:197:31: 0x1037c2f in dumpCurrentStackTrace (abc) C:\Users\Cheney.zvm\0.13.0\lib\std/posix.zig:7320:40: 0x103874e in unexpectedErrno (abc) C:\Users\Cheney.zvm\0.13.0\lib\std/posix.zig:6991:45: 0x1036fcc in getrlimit (abc) C:\Users\Cheney.zvm\0.13.0\lib\std/start.zig:448:51: 0x1035116 in expandStackSize (abc) C:\Users\Cheney.zvm\0.13.0\lib\std/start.zig:435:24: 0x1034710 in posixCallMainAndExit (abc) C:\Users\Cheney.zvm\0.13.0\lib\std/start.zig:266:5: 0x10344b1 in _start (abc) ???:?:?: 0x0 in ??? (???) hello!

Expected Behavior

There is no error.

Rexicon226 commented 1 month ago

Zig supports a minimum linux kernel of 4.19, and CentOS 5.8 would be using something like 2.8 or 3.5?

gaogaogoo commented 1 month ago

Zig supports a minimum linux kernel of 4.19, and CentOS 5.8 would be using something like 2.8 or 3.5?

I checked with uname -r, and the version displayed is 2.6.18-308.el5. Well, it’s a bit frustrating that this version is not supported.

gaogaogoo commented 1 month ago

With the same output, I built one using Rust and didn't encounter any errors. I hope that the Linux kernels supported by Rust can also be supported by Zig.

daurnimator commented 1 month ago

@gaogaogoo you need to target ancient kernels explicitly in zig:

$ cat 20959.zig 
const std = @import("std");

pub fn main() !void {
    std.debug.print("hello!\n", .{});
}
$ zig build-exe -target x86_64-linux.2.6.18-gnu 20959.zig 
$ ./20959 
hello!

note that you may also have to explicitly target an old lib depending on your system. https://distrowatch.com/table.php?distribution=centos tells me centos 5.11 uses glibc 2.5, so you probably want a target of x86_64-linux.2.6.18-gnu.2.5


Aside, why on earth are you running/writing new software to work on a system as old as centos 5.8? Centos 5.11 had an EOL in 2017, over 7 years ago!

gaogaogoo commented 1 month ago

@gaogaogoo you need to target ancient kernels explicitly in zig:

$ cat 20959.zig 
const std = @import("std");

pub fn main() !void {
    std.debug.print("hello!\n", .{});
}
$ zig build-exe -target x86_64-linux.2.6.18-gnu 20959.zig 
$ ./20959 
hello!

note that you may also have to explicitly target an old lib depending on your system. https://distrowatch.com/table.php?distribution=centos tells me centos 5.11 uses glibc 2.5, so you probably want a target of x86_64-linux.2.6.18-gnu.2.5

Aside, why on earth are you running/writing new software to work on a system as old as centos 5.8? Centos 5.11 had an EOL in 2017, over 7 years ago!

Thank you for your reply. Since some of our legacy services can only run on CentOS 5, and these services do not have source code and cannot be updated, we cannot rewrite them. Additionally, we need to develop applications that extend these services on CentOS 5. Unfortunately, when I used zig build-exe -target x86_64-linux.2.6.18-gnu main.zig to build the program, I encountered the same error. The issue persists even when using x86_64-linux.2.6.18-gnu.2.5.

rootbeer commented 1 month ago

Note that in your example, the hello! got printed, so your program worked. So the bug is more than you're getting a backtrace about an (ignored) error than the error causing real problems, correct?

Zig's lib/std/start.zig (which is invoking getrlimit) says:

                // Silently fail if we are unable to get limits.
                const limits = std.posix.getrlimit(.STACK) catch break;

So maybe that comment needs to be updated (its not silent) --- or I'm wrong about where the failing invocation of getrlimit is....

Alternatively, looking at start.zig, this getrlimit() call is only happening if the ELF binary contains a PT_GNU_STACK directive. Maybe you can disable that as a hack-around?

In any event, you might try editing start.zig to comment out this call. If that works, we should be able to get Zig to skip the getrlimit call if its not implemented. (Or get it to be more silent like the comment suggests.)

daurnimator commented 1 month ago

So maybe that comment needs to be updated (its not silent) --- or I'm wrong about where the failing invocation of getrlimit is....

That log only happens when:

pub const unexpected_error_tracing = builtin.zig_backend == .stage2_llvm and builtin.mode == .Debug;

so it should be silent in e.g. release-safe builds.... I note that errno 38 is ENOSYS, i.e. that the syscall is unimplemented. I assume this is due to zig requiring the non-broken rlimit syscalls. You can find the following note in man 2 getrlimit on a modern system:

   C library/kernel ABI differences
       Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding
       system calls, but instead employ prlimit(), for the reasons described in BUGS.

       The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().

BUGS
       In  older  Linux  kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and
       hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been.  This was fixed  in
       Linux 2.6.8.

       In  Linux  2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as "no limit" (like
       RLIM_INFINITY).  Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as  a
       limit of 1 second.

       A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.

       In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and
       RLIMIT_NICE.   This  had  the  effect  that  the  actual  ceiling  for  the  nice  value  was  calculated as
       19 - rlim_cur.  This was fixed in Linux 2.6.13.

       Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for  SIGXCPU,
       then,  in  addition to invoking the signal handler, the kernel increases the soft limit by one second.  This
       behavior repeats if the process continues to consume CPU time, until the hard limit  is  reached,  at  which
       point  the process is killed.  Other implementations do not change the RLIMIT_CPU soft limit in this manner,
       and the Linux behavior is probably not standards conformant; portable applications should avoid  relying  on
       this  Linux-specific  behavior.   The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the
       soft limit is encountered.

       Kernels before Linux 2.4.22 did not diagnose the  error  EINVAL  for  setrlimit()  when  rlim->rlim_cur  was
       greater than rlim->rlim_max.

       Linux doesn't return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.

   Representation of "large" resource limit values on 32-bit platforms
       The  glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit plat‐
       forms.  However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) un‐
       signed long.  Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms  as  unsigned
       long.  However, a 32-bit data type is not wide enough.  The most pertinent limit here is RLIMIT_FSIZE, which
       specifies  the  maximum  size to which a file can grow: to be useful, this limit must be represented using a
       type that is as wide as the type used to represent file offsets—that is, as wide as a 64-bit off_t (assuming
       a program compiled with _FILE_OFFSET_BITS=64).

       To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can
       be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the
       limit value to RLIM_INFINITY.  In other words, the requested resource limit setting was silently ignored.

       Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by  im‐
       plementing setrlimit() and getrlimit() as wrapper functions that call prlimit().

Try adding -lc to link libc, as that makes zig take a different approach to the getrlimit call.

gaogaogoo commented 1 month ago

@daurnimator The "hello" was indeed printed correctly, so the program is working. It's just that the error message made me think there might be areas where Zig could be optimized. I just tried your method of linking libc, and the unexpected error disappeared. This has been very helpful to me—thank you very much!

gaogaogoo commented 1 month ago

@daurnimator I have added code to run commands on top of the 'hello' output, and am encountering an unexpected error once again. I linked libc during the build process and hope to get further assistance. The code is as follows:

const std = @import("std");

pub fn main() !void {
    std.debug.print("hello!\n", .{});

    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    const result = try std.process.Child.run(.{
        .allocator = allocator,
        .argv = &[_][]const u8{"ls"},
        .max_output_bytes = 1024 * 1024,
    });
    defer allocator.free(result.stdout);
    defer allocator.free(result.stderr);

    std.debug.print("{s}\n", .{result.stdout});
}

The error message is as follows:

hello!
unexpected errno: 38
C:\Users\Cheney\.zvm\0.13.0\lib\std/debug.zig:197:31: 0x1074eb2 in dumpCurrentStackTrace (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std/posix.zig:7320:40: 0x1075836 in unexpectedErrno (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std/posix.zig:4016:45: 0x108f372 in eventfd (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:641:41: 0x1078ace in spawnPosix (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:242:31: 0x1043a49 in spawn (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:396:20: 0x103ae6e in run (abc)
C:\Users\Cheney\Desktop\Me\rs-df-deploy\abc\src/main.zig:14:45: 0x103aa18 in main (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std/start.zig:524:37: 0x103bd17 in main (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\libc\musl\src\env/__libc_start_main.c:95:7: 0x10ef559 in libc_start_main_stage2 (C:\Users\Cheney\.zvm\0.13.0\lib\libc\musl\src\env/__libc_start_main.c)
Unwind error at address `exe:0x10ef559` (error.AddressOutOfRange), trace may be incomplete

error: Unexpected
C:\Users\Cheney\.zvm\0.13.0\lib\std/posix.zig:7322:5: 0x107583f in unexpectedErrno (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std/posix.zig:4016:23: 0x108f37f in eventfd (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:641:24: 0x1078ee8 in spawnPosix (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:242:9: 0x1043a78 in spawn (abc)
C:\Users\Cheney\.zvm\0.13.0\lib\std\process/Child.zig:396:5: 0x103aead in run (abc)
C:\Users\Cheney\Desktop\Me\rs-df-deploy\abc\src/main.zig:14:20: 0x103aa42 in main (abc)
alexrp commented 1 week ago

As @daurnimator noted, error 38 is ENOSYS, i.e. the eventfd() syscall is missing from the kernel. If you're still linking libc, then it means even the libc wrapper can't provide a meaningful fallback.

In general, the Zig standard library has been written assuming a reasonably modern Linux kernel version, so you're likely to hit more and more of these issues. (For another example, we use statx() instead of stat(), fstat(), etc.) Your kernel version is just far too old - it's from 2006.

FWIW, though, non-invasive patches to support old systems are welcome.

daurnimator commented 1 week ago

As @alexrp notes, your kernel is so old it doesn't have eventfd. You'll continue to run into all sorts of issues due to the age of your kernel, though it's not impossible to slog through.

In this case, you'll want to add an extra bit around https://github.com/ziglang/zig/blob/6d2945f1fe387c55eff003ada6e72146daff10f2/lib/std/process/Child.zig#L648 so that e.g. when the try fails, it falls back to the pipe2 case