yast / yast-yast2

YaST module yast2
http://en.opensuse.org/Portal:YaST
GNU General Public License v2.0
54 stars 44 forks source link

Preload libsuseconnect.so if available (bsc#1194996) #1236

Closed skazi0 closed 2 years ago

skazi0 commented 2 years ago

On aarch64 installer/YaST sometimes failed to load libsuseconnect.so with "cannot allocate memory in static TLS block" error. Loading the library before others solves the problem until a better solution is found.

skazi0 commented 2 years ago

@dgdavid done (and sorry for not reading the contributing guide before).

coveralls commented 2 years ago

Coverage Status

Coverage remained the same at 41.594% when pulling 6a14acda1b94e98da6659896e07b73de96fcbd82 on skazi0:libsuseconnect-aarch64-preload into ad8d240b5e6a76773dc4a367aa02b9ba9feb3457 on yast:master.

dgdavid commented 2 years ago

@dgdavid done

Thanks!

(and sorry for not reading the contributing guide before).

No problem at all :smiley:

yast-bot commented 2 years ago

:heavy_check_mark: Public Jenkins job #353 successfully finished :heavy_check_mark: Created OBS submit request #948941

yast-bot commented 2 years ago

:heavy_check_mark: Internal Jenkins job #190 successfully finished :heavy_check_mark: Created IBS submit request #263042

mvidner commented 2 years ago

BTW from https://github.com/pytorch/builder/issues/758#issuecomment-889974328 I guess that a possible solution is to not preallocate memory within libsuseconnect. But I don't know its implementation details.

skazi0 commented 2 years ago

@mvidner see my comments on https://bugzilla.suse.com/show_bug.cgi?id=1194996 The TLS variable is from Go runtime. I don't think we can change this part.

kobliha commented 2 years ago

It seems this PR is causing quite some issues, e.g., https://bugzilla.suse.com/show_bug.cgi?id=1196326

shundhammer commented 2 years ago

Since this causes follow-up problems, we decided to revert this.

https://bugzilla.suse.com/show_bug.cgi?id=1196326

https://bugzilla.suse.com/show_bug.cgi?id=1194996#c11

The workaround for this was to set LD_PRELOAD in the YaST start script if libsuseconnect.so was found, but that causes new problems: This is inherited by child processes, and now every binary started from YaST also gets this LD_PRELOAD, so they are now also force-linked to libsuseconnect.so.

YaST uses external programs extensively; among lots of other things, for storage probing. That means that in many cases, storage probing now fails because of this; mostly due to out of memory because libsuseconnect.so (and the Go runtime libs that it uses?) consumes a lot more RAM.

Not only are we always on the verge of OOM anyway (and we are at the end of the food chain after the kernel, kernel modules, kernel firmware, glibc, libQt and dozens of others), but who knows what other problems this causes. This needs to stop.

It has been suggested that we should clean up LD_PRELOAD again after the YaST start. But after an internal discussion, we decided against that: It would be adding a hack on top of a hack; a workaround for a workaround. And that might create even more problems.

The original problem appears in a low-memory scenario only on aarch64; and now the first hack spread the problem to other architectures as well.

This clearly needs a real solution, not adding hacks on top of hacks. The original problem has been known since early 2019; it even made it to Stack Overflow, where they suggest do set LD_PRELOAD as a workaround.

After almost 3 years (maybe longer), there must be a real fix for this. It really can't be an acceptable solution to spread those workarounds all over all kinds of software; other projects will have the same problem as well.

LD_PRELOAD is a debugging tool for desperate situations, a last-ditch effort; it can't be the regular way to go. But recommendations on the web seem to indicate just that for the aarch64 platform.

It doesn't help the Open Source world if we keep fighting the symptoms and disguise the real problem; this basically forces everyone who ever uses any Golang-based lib to do the same hacky approach, and most of them won't even be aware of follow-up problems for child processes.

That is why we are going to revert the LD_PRELOAD change. This needs a true fix.