ysbaddaden / gc

A garbage collector for Crystal
95 stars 6 forks source link

Crash on a loop allocating big strings #12

Closed beta-ziliani closed 2 years ago

beta-ziliani commented 2 years ago

The following loop core dumps (and works fine with libgc).

require "immix"

MIN =  1000
MAX = 32510

(MIN..MAX).each { |i|
  s = "string" * (i / 1_000).to_i # if I replace the calculation by `i`, it works
}

crystal 1.4.0 LLVM 12.0.1 Default target: x86_64-pc-linux-gnu

lib/immix built with make -B CUSTOM="-DNDEBUG"

ysbaddaden commented 2 years ago

@beta-ziliani it may be related to https://github.com/ysbaddaden/gc/issues/11#issuecomment-979162619 ; something changed in crystal at some point and this GC doesn't work anymore.

ysbaddaden commented 2 years ago

This is unrelated with #11. The GC is working, but there is a bug while allocating:

0x000055555562ac71 in GC_LocalAllocator_allocateSmall (self=0x555555849410, size=160, atomic=0) at src/local_allocator.c:121
121     void *GC_LocalAllocator_allocateSmall(LocalAllocator *self, size_t size, int atomic) {

I have to set MIN=10000 for the bug to trigger. It's forcing the allocation of lots of small objects (hundreds of the same size, then it grows by 1 byte, and so on). Removing the division indeed makes the program pass (neat, we can see the GC allocating and collecting), but we now allocate one object of a size, then another of 6 bytes larger.

GC: malloc object=0x68cfea0 size=217 actual=240 atomic=1 ptr=0x68cfeb0
==17586== Conditional jump or move depends on uninitialised value(s)
==17586==    at 0x1DEFE3: LocalAllocator_findNextHole (local_allocator.c:31)
==17586==    by 0x1DEFE3: LocalAllocator_tryAllocateSmall (local_allocator.c:115)
==17586==    by 0x1DEFE3: GC_LocalAllocator_allocateSmall (local_allocator.c:126)
==17586==    by 0x1DD713: GC_malloc_with_atomic (immix.c:74)
==17586==    by 0x1DD713: GC_malloc (immix.c:95)
==17586==    by 0x17F6D5: *GC::malloc<UInt64>:Pointer(Void) (immix.cr:22)
==17586==    by 0x14890D: __crystal_malloc64 (gc.cr:24)
==17586==    by 0x19BFB4: *Fiber::new<String, &Proc(Nil)>:Fiber (fiber.cr:88)
==17586==    by 0x157CEB: *spawn:name<String, &Proc(Nil)>:Fiber (concurrent.cr:61)
==17586==    by 0x17F77A: *GC::collect:Nil (immix.cr:40)
==17586==    by 0x158885: GC_collect (immix.cr:4)
==17586==    by 0x1DE052: GlobalAllocator_tryCollect (global_allocator.c:147)
==17586==    by 0x1DE052: GC_GlobalAllocator_nextBlock (global_allocator.c:164)
==17586==    by 0x1DEFEC: LocalAllocator_initCursor (local_allocator.c:10)
==17586==    by 0x1DEFEC: GC_LocalAllocator_allocateSmall (local_allocator.c:135)
==17586==    by 0x1DD7A3: GC_malloc_with_atomic (immix.c:74)
==17586==    by 0x1DD7A3: GC_malloc_atomic (immix.c:99)
==17586==    by 0x17F495: *GC::malloc_atomic<UInt64>:Pointer(Void) (immix.cr:26)
==17586==  Uninitialised value was created by a heap allocation
==17586==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17586==    by 0x1DD4AC: GC_init (immix.c:40)
==17586==    by 0x17F705: *GC::init:Nil (immix.cr:12)
==17586==    by 0x1DD23A: *Crystal::main<Int32, Pointer(Pointer(UInt8))>:Int32 (main.cr:35)
==17586==    by 0x155D95: main (main.cr:127)
==17586==
==17586== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
Invalid memory access (signal 11) at address 0x1ffe801ff8

Aaaand... this is again caused by starting the collector loop in a Fiber (see the backtrace), which is indeed a very bad moment to create this fiber, because it requires to allocate memory, which can pass sometimes (if there are some memory left), but... what if we filled it?

There is still a bug: self->next is uninitialized when LocalAllocator_findNextHole() tries to access it, when the memory is filled, and there is no more block (yet). I don't recall the immix algorithm, and the block, chunk, and hole terms are totally foreign to me, now :sob:

Maybe the fiber is completely wrong, and Immix GC itself should start a thread (completely distinct from Crystal), and switch execution to it, but if I recall correctly I switched fibers to make sure that the current context was saved on the stack (again, if I recall correctly).

ysbaddaden commented 2 years ago

I think understand what's the issue with the Fiber (not the details but the why): we can't do anything in GC.init except calling external functions that don't expect the Crystal runtime, because Crystal calls it before starting up the "runtime" (main user code) which initializes constants, globals, and everything.

Delaying is the good solution but we can't delay it until we start collecting (as HEAD is currently doing), otherwise we'd get a Fiber.new which is trying to allocate, but we're collecting, so memory may be full, so we must do it while the runtime is starting up.

Just doing @@collector = spawn { collector_loop } fixes the spawn issue (trying to collect won't try to allocate memory anymore), except that the program eventually hangs in a forever sleep (I assume @@pending isn't properly set).

ysbaddaden commented 2 years ago

The following commits fixed the issue:

There are more issues in Crystal and/or the GC, but this reproducible issue is fixed.