ximion / appstream-generator

A fast AppStream metadata generator
GNU Lesser General Public License v3.0
43 stars 29 forks source link

Crashes during parallel tasks #100

Closed pabloyoyoista closed 1 year ago

pabloyoyoista commented 2 years ago

I have a testing setup for generating the appstream data using appstream-generator in alpine and I am seeing some crashes during parallel operations. These errors happen seldomly and I don't really have a good reproducer or have a clear idea of the packages that were being processed by the generator when this happened. I know that opening issues of the kind "this isn't working!" are really not good, so my goal is more to ask how could I debug this, or what would be needed to trim the error down. I am also happy to help debugging any way possible.

Error output look like this:

core.exception.RangeError@../src/asgen/engine.d(532): Range violation
----------------
??:? onRangeError [0x7fa7dcd9cc90]
??:? _d_arraybounds [0x7fa7dcd9d370]
??:? /usr/bin/appstream-generator [0x55e77d5783e0]
??:? void std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)).doIt() [0x55e77d57ed30]
??:? void std.parallelism.TaskPool.executeWorkLoop() [0x7fa7dd0bcbd0]
??:? thread_entryPoint [0x7fa7dcdc2d60]

edit, there seems to be another variation of the crash. Unfortunately still didn't manage to get a core dump:

core.exception.RangeError@../src/asgen/engine.d(532): Range violation
----------------
??:? onRangeError [0x7f5709254c90]
??:? _d_arraybounds [0x7f5709255370]
??:? /usr/bin/appstream-generator [0x55e16500a3f0]
??:? void std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)).doIt() [0x55e165010d40]
??:? void std.parallelism.submitAndExecute(std.parallelism.TaskPool, void delegate()) [0x7f5709575ea0]
??:? int std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)) [0x55e165008d60]
??:? void asgen.engine.Engine.exportIconTarballs(asgen.config.Suite, immutable(char)[], asgen.backends.interfaces.Package[]) [0x55e165009ea0]
??:? bool asgen.engine.Engine.processSuiteSection(asgen.config.Suite, const(immutable(char)[]), asgen.reportgenerator.ReportGenerator) [0x55e16500b1c0]
??:? void asgen.engine.Engine.run(immutable(char)[]) [0x55e16500c360]
??:? _Dmain [0x55e164fbee00]
ximion commented 2 years ago

It's annoying that this only happens occasionally... It would help to generate a backtrace for this, you can maybe run this under GDB all the time with debug symbols present and automatically generate a backtrace on error. Or use systemd-coredumpd for debugging, which is an awesome tool for issues like this! Also, make sure you are on the latest appstream-generator version, 0.8.7. I also don't see how you could get a range violation there, as the associative array access is guarded by a sync statement, but you could try to replace the line synchronized (this) iconTarFiles[iconSize.toString] ~= path; with synchronized iconTarFiles[iconSize.toString] ~= path; and see if that makes a difference (it changes the synchronization from being tied to just the current object to be a global lock, so nothing else will run in parallel while the following statement is executed).

pabloyoyoista commented 2 years ago

Ok, thank you! systemd isn't really available in alpine, but I'll try to figure out a way add debug symbols and get coredump or GDB to extract a backtrace. I will report my findings! If it doesn't work, then I guess I'll follow on the synchronization changes you mention.

ximion commented 2 years ago

Could this issue actually have been related to https://github.com/ximion/appstream-generator/issues/101 ? Can you check if that patch fixes your issue?

pabloyoyoista commented 2 years ago

I have tried updating to 0.8.8, but it looks like https://github.com/ximion/appstream-generator/commit/922c2108af881c0580af169953b5359ba4544bc0 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?

ximion commented 2 years ago

I have tried updating to 0.8.8, but it looks like 922c210 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?

I would either 1) Get Meson 0.62+ into Alpine 2) Revert the test fix for now and fix it once you have AppStream 0.15.3

That's a bit better then disabling tests and forgetting that they are disabled - you can remove the reverted patch once AppStream 0.15.3 has landed.

pabloyoyoista commented 2 years ago

People at Alpine got 0.62, so I have done some testing. I have been trying to get a backtrace with gdb, but it doesn't seem to be getting the symbols right. I have seen at least the problem once since the upgrade, though, so it might not be totally gone. I will try to keep testing this, but I am quite slow in the process due to other tasks and my lack of experience debugging something like this. Sorry for that.

minlexx commented 2 years ago

The issue #101 mentions stack size problem, this is one of the differences of musl compared to glibc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Thread-stack-size

pabloyoyoista commented 2 years ago

Just to follow-up, I have the generator running under gdb, with the following script for more than a week and still no crashes. I am sharing it here, because I am not sure that I am doing something wrong...

handle SIGUSR1 nostop noprint                  
handle SIGUSR2 nostop noprint    

catch signal SIGSEGV                                                     
      command 1                                                  
      backtrace full                                              
      shell touch /cache/export/logs/$(date "+%Y%m%d").fail   
end                                                       

run  
quit   
ximion commented 2 years ago

This looks pretty much like what I was doing a long time ago on Ubuntu, so I think your gdb commands are fine - it's just weird that the crashes are gone then!

pabloyoyoista commented 2 years ago

Ok, thank you! Let's see if I manage to capture it. Otherwise, I guess blindly increasing the stack size like Alexei pointed could be an option...

arrowd commented 1 year ago

I'm getting random crashes too, here's a part of the backtrace:

Thread 14 received signal SIGBUS, Bus error.
Object-specific hardware error.
[Switching to LWP 107787 of process 91986]
0x0000000800d8dad9 in ?? () from /usr/local/lib/libarchive.so.13
(gdb) bt
#0  0x0000000800d8dad9 in  () at /usr/local/lib/libarchive.so.13
#1  0x0000000800d88268 in  () at /usr/local/lib/libarchive.so.13
#2  0x0000000800d87bf5 in  () at /usr/local/lib/libarchive.so.13
#3  0x0000000800d87c7a in  () at /usr/local/lib/libarchive.so.13
#4  0x00000000004e040b in _D5asgen8zarchive19ArchiveDecompressor8readDataMFAyaZAxh (this=0x2b, fname=...) at ../src/asgen/zarchive.d:292

this=0x2b looks suspicious, however going a frame up and printing the ArchiveDecopressors variable address shows the correct one.

arrowd commented 1 year ago

Just a note, the crash doesn't happen during Scanned ... phase, only Processing ... phase.

pabloyoyoista commented 1 year ago

It's been a while (I see now that around a year...), and I finally got access to a highly-parallel container to troubleshoot the crashes we were seeing in alpine. It has run now 3 or 4 complete iterations, including one where most of the data had to be re-generated. Part of the fix was certainly https://github.com/ximion/appstream/pull/484, but also #114

So I'm closing this, thanks a lot everybody for the help. I'll open a new bug if we start seeing this again.

ximion commented 1 year ago

Very neat! Thank you a lot work working on this and looking into it!