zlib-ng / zlib-ng

zlib replacement with optimizations for "next generation" systems.
zlib License
1.53k stars 253 forks source link

SPARC64 crash with Ubuntu 22.04 and gcc 11.3.0 #1379

Open mtl1979 opened 1 year ago

mtl1979 commented 1 year ago

All tests fail or crash in Ubuntu 22.04 when using gcc 11.3.0 and qemu. When building with gcc 9.4.0, all tests pass.

nmoinvaz commented 1 year ago

I confirmed the issue. It appears to crash even with a simple main() that return 0;. I upgraded my WSL to Ubuntu 22 and tried GCC 11.3.0 & 10.4.0 & 9.5.0 and they all seg fault.

nmoinvaz commented 1 year ago

@mtl1979 do you want to close this issue?

mtl1979 commented 1 year ago

@nmoinvaz Like with the little-endian 64-bit PowerPC, this might be useful as converted to a discussion, so we don't need to create another one when eventually qemu is fixed... I'm not sure yet if the underlying issue is exactly same block of code for both architectures or is there possibly just some overlap.

Like I said to @Dead2 elsewhere, downgrading to "ubuntu-20.04" runner should be a temporary solution -- to avoid delaying next stable version for too long -- until the real issue is fixed and new packages have been uploaded.

KungFuJesus commented 1 year ago

If there is a particular issue you maybe want me to test or explore, I do have an UltraSPARC T4 with Solaris 11.2 (or maybe it was 3) that I can test this on. I'm fairly certain the T4 is EOL from Oracle at the moment but it's probably a close enough approximation on newer variants.

mtl1979 commented 1 year ago

@KungFuJesus If it's qemu bug or regression, testing on real hardware doesn't make sense. If it's gcc issue, we need to know what flag is missing or incorrect.

KungFuJesus commented 1 year ago

FWIW gtests pass with flying colors on OpenIndiana on an ancient Sun Fire V240. Though, the symbol versioning doesn't quite seem to be supported with those arguments to the linker, even with GNU's ld. We should probably look into that.

mtl1979 commented 1 year ago

FWIW gtests pass with flying colors on OpenIndiana on an ancient Sun Fire V240. Though, the symbol versioning doesn't quite seem to be supported with those arguments to the linker, even with GNU's ld. We should probably look into that.

You might want to create another issue about the symbol versioning issue with all the relevant logs.

KungFuJesus commented 1 year ago

So a somewhat interesting revelation, tests are passing but when I build with -mcpu=native, I do get segfaults. But to my surprise, the segfaults are in the benchmark library with some C++ string allocations:

Loading modules: [ libc.so.1 ld.so.1 ]
> ::stack
libc.so.1`realfree+0x38(100287890, 0, 0, ffffffff7fffea68, 1, 0)
libc.so.1`cleanfree+0x5c(0, 0, 10028ad90, 10028adb0, 0, c00)
libc.so.1`_malloc_unlocked+0x80(1002874d0, 10028ad50, 10028ad70, 10028ad90, 10028adb0, 0)
libc.so.1`malloc+0x3c(1f, 10028ad30, 10028ad50, 10028ad70, 10028ad90, 0)
libstdc++.so.6.0.29`_Znwm+0x18(1f, ffffffff7fffd638, 1e, 10028ad50, 10028ad70, 1f)
libstdc++.so.6.0.29`_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE7reserveEm+0x30(ffffffff7fffd858, 1b, 10028ad10, 10028ad30, 10028ad50, ffffffff7fffd868)
_ZN9benchmark12_GLOBAL__N_14joinIJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_S7_S7_S7_S7_S7_S7_EEES7_cDpRKT_+0x6c(ffffffff7fffd858, 2f, 10028acd0, 10028acf0, 10028ad10, 10028ad30)
_ZNK9benchmark13BenchmarkName3strB5cxx11Ev+0x74(ffffffff7fffd858, 10028acd0, 10028ad30, 10028ad10, 10028acf0, 10028acd0)
_ZN9benchmark8internal15BenchmarkRunner13DoNIterationsEv+0xc0(ffffffff7fffdc70, 100286820, 0, ffffffff7fffea68, 1, 10026ac10)
_ZN9benchmark8internal15BenchmarkRunner15DoOneRepetitionEv+0xd0(100286820, 12, ffffffff7fffea68, 3ce7f2, 0, 100287440)
_ZN9benchmark8internal12_GLOBAL__N_113RunBenchmarksERKSt6vectorINS0_17BenchmarkInstanceESaIS3_EEPNS_17BenchmarkReporterES9_+0xa08(ffffffff7ffff6e0, 100285410, 1000, 1000, 100285410, 100131290)
_ZN9benchmark22RunSpecifiedBenchmarksEPNS_17BenchmarkReporterES1_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x528(0, 0, ffffffff7ffffa40, 0, 0, 0)
_ZN9benchmark22RunSpecifiedBenchmarksEv+0x34(ffffffff7ffffb90, ffffffff7ffffc78, 1000c281c, 1000ba418, ffffffff6f745f28, ffffffff6f7460f8)
main+0x3c(0, ffffffff7ffffc78, ffffffff7ffffc88, ffffffff6f745660, 18, 145998)
_start_crt+0x7c(1, ffffffff7ffffc78, ffffffff6f61d960, 0, 0, 0)
_start+0x14(0, 0, 0, 0, 0, 0)

I should also mention this is on a illumos-derived openindiana with a barely supported configuration at the moment, not solaris proper. I do wonder what we'd see on Solaris 11 on the T4. It almost looks like a glibc C++ bug.

mtl1979 commented 1 year ago

@KungFuJesus I would assume heap corruption... I've seen that happen when unknown or bad instruction doesn't trap and instead gets decoded as unrelated instruction. If I'm correct, the real issue happens earlier than the function on the stack trace just below libstdc++.so.6.0.29.

KungFuJesus commented 1 year ago

And of course using umem as the allocator to find an issue it fails to cause any at all. Could be an issue in glibc's allocator while on SPARC?

mtl1979 commented 1 year ago

@KungFuJesus I would assume either gcc generated bad instruction or glibc was built targeting for too new processor and it doesn't detect that current processor doesn't support certain instruction it assumes is available. Might be possible to force gcc to target older processor to see which ones still run.

KungFuJesus commented 1 year ago

This is running on a sun4v, I doubt we're seeing an illegal instruction. The sparcv9 abi hasn't moved a ton, and it's certainly not generating VIS (which is what I set out to make this thing do, initially). I'm seeing some other evidence it could be an issue somewhere at the allocator but I have no smoking gun. I've emailed the distribution maintainer with a link to this thread, hopefully he can shed some light.

mtl1979 commented 1 year ago

I'm seeing heap corruption on PPC64LE too, so it might be just buffer overrun or something similar. I tried switching to clang, but it still uses libraries from gcc 11 by default, unless I force it to use LLVM libc instead.

thesamesam commented 1 year ago

LLVM libc is barely a thing yet, you probably mean libc++ which is an implementation of the C++ standard library. But there's a few other libraries Clang will try to use from GCC, like the runtime lib & unwinding.

mtl1979 commented 1 year ago

@thesamesam I noticed when I tried installing clang on clean system, it just wouldn't work... I had to install quite a few packages from gcc to get it behave.

https://github.com/zlib-ng/zlib-ng/blob/develop/.github/workflows/cmake.yml#L256

klausz65 commented 1 year ago

For what it's worth. OpenIndiana on SPARC is still build using gcc-4.4.4: CFLAGS: -mcpu=ultrasparc -mvis and the long list from Makefile.master for both 32 and 64bit code. ASFLAGS (Note: at present this must be SunAS Assembler configured with gcc) 32Bit: -xarch=v8plusa -xarch=sparcvis 64Bit: -xarch=v9 -xarch=sparcvis All the oi-userland stuff is compiled using gcc-11.3.0 with GAS: 2.39 or 2.40 CFLAGS: -O3 -mcpu=ultrasparc -mvis -mfsmuld and just recently again with: -mno-app-regs to be on a saver side, I didn't really noticed any performance degradation by not using this option.

mtl1979 commented 1 year ago

@klausz65 We would need a "CI run" to test with what CFLAGS and CXXFLAGS allows Ubuntu 22.04 to compile usable binaries on 64-bit SPARC... Then we can rule if the issue is in gcc or qemu's SPARC64 emulation. As it works on Ubuntu 20.04, and there is known issues with qemu versions at least from 6.x series upward (4.x series is known to work), we already have some information to narrow the research.

Personally I've worked with some compiler bugs, but I'm not familiar enough with the compiler options for SPARC64 and I don't have SPARC64 hardware to test, so I can't help much further...

glaubitz commented 1 year ago

On Jul 5, 2023, at 5:11 PM, Mika Lindqvist @.> wrote: @klausz65 We would need a "CI run" to test with what CFLAGS and CXXFLAGS allows Ubuntu 22.04 to compile usable binaries on 64-bit SPARC... Then we can rule if the issue is in gcc or qemu's SPARC64 emulation. As it works on Ubuntu 20.04, and there is known issues with qemu versions at least from 6.x series upward (4.x series is known to work), we already have some information to narrow the research. Personally I've worked with some compiler bugs, but I'm not familiar enough with the compiler options for SPARC64 and I don't have SPARC64 hardware to test, so I can't help much further...Message ID: @.> SPARC hardware for testing can be accessed through the GCC compile farm. Just ask for an account and you will be able to get SSH access to various architectures.See: https://gcc.gnu.org/wiki/CompileFarm

mtl1979 commented 1 year ago

@glaubitz Compile farms are nice, but usually it's hard to have specific package versions on them. Like I already said, we need specific versions of both gcc and qemu. This allows us to "emulate" older hardware than the compile farm possibly actually use.