Closed pabloyoyoista closed 1 year ago
It's annoying that this only happens occasionally... It would help to generate a backtrace for this, you can maybe run this under GDB all the time with debug symbols present and automatically generate a backtrace on error. Or use systemd-coredumpd
for debugging, which is an awesome tool for issues like this!
Also, make sure you are on the latest appstream-generator version, 0.8.7
.
I also don't see how you could get a range violation there, as the associative array access is guarded by a sync statement, but you could try to replace the line synchronized (this) iconTarFiles[iconSize.toString] ~= path;
with synchronized iconTarFiles[iconSize.toString] ~= path;
and see if that makes a difference (it changes the synchronization from being tied to just the current object to be a global lock, so nothing else will run in parallel while the following statement is executed).
Ok, thank you! systemd
isn't really available in alpine, but I'll try to figure out a way add debug symbols and get coredump or GDB to extract a backtrace. I will report my findings! If it doesn't work, then I guess I'll follow on the synchronization changes you mention.
Could this issue actually have been related to https://github.com/ximion/appstream-generator/issues/101 ? Can you check if that patch fixes your issue?
I have tried updating to 0.8.8, but it looks like https://github.com/ximion/appstream-generator/commit/922c2108af881c0580af169953b5359ba4544bc0 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?
I have tried updating to 0.8.8, but it looks like 922c210 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?
I would either 1) Get Meson 0.62+ into Alpine 2) Revert the test fix for now and fix it once you have AppStream 0.15.3
That's a bit better then disabling tests and forgetting that they are disabled - you can remove the reverted patch once AppStream 0.15.3 has landed.
People at Alpine got 0.62, so I have done some testing. I have been trying to get a backtrace with gdb, but it doesn't seem to be getting the symbols right. I have seen at least the problem once since the upgrade, though, so it might not be totally gone. I will try to keep testing this, but I am quite slow in the process due to other tasks and my lack of experience debugging something like this. Sorry for that.
The issue #101 mentions stack size problem, this is one of the differences of musl compared to glibc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Thread-stack-size
Just to follow-up, I have the generator running under gdb, with the following script for more than a week and still no crashes. I am sharing it here, because I am not sure that I am doing something wrong...
handle SIGUSR1 nostop noprint
handle SIGUSR2 nostop noprint
catch signal SIGSEGV
command 1
backtrace full
shell touch /cache/export/logs/$(date "+%Y%m%d").fail
end
run
quit
This looks pretty much like what I was doing a long time ago on Ubuntu, so I think your gdb commands are fine - it's just weird that the crashes are gone then!
Ok, thank you! Let's see if I manage to capture it. Otherwise, I guess blindly increasing the stack size like Alexei pointed could be an option...
I'm getting random crashes too, here's a part of the backtrace:
Thread 14 received signal SIGBUS, Bus error.
Object-specific hardware error.
[Switching to LWP 107787 of process 91986]
0x0000000800d8dad9 in ?? () from /usr/local/lib/libarchive.so.13
(gdb) bt
#0 0x0000000800d8dad9 in () at /usr/local/lib/libarchive.so.13
#1 0x0000000800d88268 in () at /usr/local/lib/libarchive.so.13
#2 0x0000000800d87bf5 in () at /usr/local/lib/libarchive.so.13
#3 0x0000000800d87c7a in () at /usr/local/lib/libarchive.so.13
#4 0x00000000004e040b in _D5asgen8zarchive19ArchiveDecompressor8readDataMFAyaZAxh (this=0x2b, fname=...) at ../src/asgen/zarchive.d:292
this=0x2b
looks suspicious, however going a frame up and printing the ArchiveDecopressor
s variable address shows the correct one.
Just a note, the crash doesn't happen during Scanned ...
phase, only Processing ...
phase.
It's been a while (I see now that around a year...), and I finally got access to a highly-parallel container to troubleshoot the crashes we were seeing in alpine. It has run now 3 or 4 complete iterations, including one where most of the data had to be re-generated. Part of the fix was certainly https://github.com/ximion/appstream/pull/484, but also #114
So I'm closing this, thanks a lot everybody for the help. I'll open a new bug if we start seeing this again.
Very neat! Thank you a lot work working on this and looking into it!
I have a testing setup for generating the appstream data using
appstream-generator
in alpine and I am seeing some crashes during parallel operations. These errors happen seldomly and I don't really have a good reproducer or have a clear idea of the packages that were being processed by the generator when this happened. I know that opening issues of the kind "this isn't working!" are really not good, so my goal is more to ask how could I debug this, or what would be needed to trim the error down. I am also happy to help debugging any way possible.Error output look like this:
edit, there seems to be another variation of the crash. Unfortunately still didn't manage to get a core dump: