wren-lang / wren

The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.
http://wren.io
MIT License
6.91k stars 552 forks source link

wrenEnsureSlots corruption/crash #510

Closed sponge closed 5 years ago

sponge commented 6 years ago

I'm running into a weird crash/memory corruption issue where calling I'm calling a class constructor from C++, which calls into Wren, which then calls into a native function. This native function is then calling wrenEnsureSlots with a large enough amount that it triggers a reallocation. After that, Bad Stuff seems to happen, and I either end up with a crash in a random unrelated place, or the VM thinks it has a random negative number amount of slots, and fails.

I don't have a standalone example here, but I can probably try and create one if needed. Here's roughly what I'm doing:

    // make sure we can find a new Game class
    wrenEnsureSlots(vm, 1);
    wrenGetVariable(vm, "main", "Game", 0);
    WrenHandle *gameClass = wrenGetSlotHandle(vm, 0);

    if (gameClass == nullptr) {
        trap->Error(ERR_FATAL, "couldn't find Game class");
        return nullptr;
    }

    // instantiate a new Game
    wrenEnsureSlots(vm, 2);
    wrenSetSlotHandle(vm, 0, gameClass);
    wrenSetSlotString(vm, 1, constructorStr);
// wrenGetSlotCount returns 2
    wrenCall(vm, newHnd);
// wrenGetSlotCount returns a random negative number
    wrenReleaseHandle(vm, newHnd);
    wrenReleaseHandle(vm, gameClass);

The Game.new constructor then calls into a native function that returns a large amount of data: var objs = TileMap.objectsInLayer(objectLayer)

The handler for objectsInLayer is a complex function that returns a potential large amount of data, however I've simplified it down to just a wrenEnsureSlots call that will still break things:

void wren_map_getobjectsinlayer(WrenVM *vm) {
    wrenEnsureSlots(vm, 20);
}

I followed this into wrenEnsureStack, and as long as it returns out at if (fiber->stackCapacity >= needed) return; everything works OK, but as soon as I cross that threshold, I start running into problems.

If I change the initial call to wrenEnsureSlots(vm, 2); to a larger number, avoiding the reallocation once I'm inside wrenCall, everything seems to be OK.

ruby0x1 commented 6 years ago

I'm not sure it's related, but I had to up WREN_MAX_TEMP_ROOTS which runs out, and then collections happen mid-import statement (and at other various times). Sometimes the VM uses it's own systems (like a map of modules) in the C side, and if that gets collected because it failed to become a root, the memory would have been used for something else and it pulls the rug, so you get random corruption and bogus values.

Try pushing this way up like 100 as a quick test, and also check the amount of default memory allocated, i've seen it happen where the gc is over zealous during initial construction phase, changing how much memory and when to collect are a good test for whether it's GC getting in there.

munificent commented 6 years ago

Ooh, nasty. It definitely sounds like something is getting collected that shouldn't be. One thing you can try to help diagnose this is in wren_common.h, set this:

#define WREN_DEBUG_GC_STRESS 1

Then run your program. That flag makes the allocator run a full GC before every allocation, so it might help find the failure sooner. You can also enable this:

#define WREN_DEBUG_TRACE_GC 0

That logs every freed object during a GC, which might help you track down which object is being collected that shouldn't be. It tends to be really chatty, though, so this might not be super helpful.

I assume you have asserts enabled, right? If not, definitely enable those. That will tell you if WREN_MAX_TEMP_ROOTS is too low and you're overflowing it. It shouldn't be — the VM is designed such that there is a maximum number of temporary roots that can ever be needed at one point in time — but there may be bugs.

If you put together a repro, I can take a look.

sponge commented 6 years ago

Here's a one file that is triggering the behavior I'm seeing.

When FOREIGN_ENSURE_SLOTS (5) is greater than CONSTRUCTOR_ENSURE_SLOTS (2), I get output like:

slot count before call: 2
slot count after call: -1560

but when CONSTRUCTOR > FOREIGN, I get:

slot count before call: 5
slot count after call: 1

This sample is probably too simple to cause the crashes and weird behavior I was seeing, but the slot issue is definitely happening

#include "vm/wren.h"
#include <cassert>

#define FOREIGN_ENSURE_SLOTS 5
#define CONSTRUCTOR_ENSURE_SLOTS 2

void wren_foreign(WrenVM *vm) {
    wrenEnsureSlots(vm, FOREIGN_ENSURE_SLOTS);
    wrenSetSlotNewList(vm, 0);
    for (int i = 1; i < FOREIGN_ENSURE_SLOTS; i++) {
        wrenSetSlotDouble(vm, i, i);
        wrenInsertInList(vm, 0, -1, i);
    }
}

WrenForeignMethodFn wren_bindForeignMethodFn(WrenVM* vm, const char* module, const char* className, bool isStatic, const char* signature) {
    return wren_foreign;
}

static void wren_error(WrenVM* vm, WrenErrorType type, const char* module, int line, const char* message) {
    printf("%s:%i - %s\n", module, line, message);
}

int main()
{
    WrenConfiguration config;
    wrenInitConfiguration(&config);
    config.errorFn = wren_error;
    config.bindForeignMethodFn = wren_bindForeignMethodFn;

    WrenVM *vm = wrenNewVM(&config);

    WrenInterpretResult res = wrenInterpret(vm,
        "class Engine { \
            foreign static testfn() \
        }");
    assert(res == WREN_RESULT_SUCCESS);

    res = wrenInterpret(vm,
        "class Game { \
            construct new(mapName) { \
                Engine.testfn() \
            } \
        }");

    assert(res == WREN_RESULT_SUCCESS);

    // make sure we can find a new Game class
    wrenEnsureSlots(vm, 1);
    wrenGetVariable(vm, "main", "Game", 0);
    WrenHandle *gameClass = wrenGetSlotHandle(vm, 0);

    if (gameClass == NULL) {
        abort(0);
    }

    WrenHandle *newHnd = wrenMakeCallHandle(vm, "new(_)");

    // instantiate a new Game
    wrenEnsureSlots(vm, CONSTRUCTOR_ENSURE_SLOTS);
    wrenSetSlotHandle(vm, 0, gameClass);
    wrenSetSlotString(vm, 1, "constructor string");

    // wrenGetSlotCount returns 2
    printf("slot count before call: %i\n", wrenGetSlotCount(vm));
    wrenCall(vm, newHnd);
    // wrenGetSlotCount returns a random number
    printf("slot count after call: %i\n", wrenGetSlotCount(vm));

    wrenReleaseHandle(vm, newHnd);
    wrenReleaseHandle(vm, gameClass);

    return 0;
}
mhermier commented 6 years ago

I think you hit a fragility in the code I'm aware of. In current form the vm doesn't like the user to resize the stack from foreign. https://github.com/munificent/wren/blob/7487eeab49841de59faf293409d4e3a6098295ed/src/vm/wren_vm.c#L370 and https://github.com/munificent/wren/blob/7487eeab49841de59faf293409d4e3a6098295ed/src/vm/wren_vm.c#L665 makes the usage of wrenEnsureSlots unrelyable from within a foreign. I would suggest you to call wrenEnsureSlots from main with a huge value before any wrenCall until @munificent fixes it.

mhermier commented 6 years ago

At first I was thinking to let @munificent bite the bullet and come with its solution (because I have patch for that, required for reentrancy). But thinking at it a second time, maybe its time to hit the bullet and explain what is happening behind the scene and get things properly done, even if it might impact speed, and force to revise the API to get speed back.

So what is WrenVM::apiStack: it is a trick to avoid to pass/compute stackStart (see wrenInterpret or CallFrame) all around. The problem here is that we try to do a little of reantrancy here and we get a corrupt stack as a result. And there is another possibility to mess the things even more if we managed the vm to change of fiber.

The correct way to solve this would be to move/rename WrenVM::apiStack to ObjFiber::stackStart, that way we can fix the stackStart wren calling wrenEnsureSlot. Doing this has 2 flows:

  1. It impact speed. Since now to reach the apiStack we need to do vm->fiber->stackStart instead of vm->apiStack this extra indirection become quite costly in the public API.
  2. It doesn't solve anything, because we still can't remember previous stack base. But there is a solution, changing CallFrame::stackStart to CallFrame::oldStackStart, so that it remembers the previous stackStart and not the current one since it is now available in the fiber. That way we only have to do a CallFrame push and pop before these problematic spots fixed.

It is a quite costly solution, this is why I did not wanted to pollute @munificent ideas with it and see what solution he can came up with. But with me having toyed so much with the code, it sounds to me to be the only proper solution.

Now if we go that way, and want to restore speed back, we have to expose ObjFiber to the public API, but we have to change the semantic of some API calls. Incidentally Fibers should somehow now about their VM, so that we don't have to pass VM pointers all around the C stack. But it resolve the indirection speed issue, and even Fiber::stackTop become more cheap to reach, so it means that frame security* might be done cheaply and be done even in release, and not in debugging only.

munificent commented 6 years ago

@mhermier, your understanding of the problem is (as usual!) correct.

I'm hoping to avoid re-entrancy in the VM for a number of reasons:

  1. It makes it harder for host app developers to reason about what it means to "return" from calling Wren code. They may need to return through multiple levels of C calls before the host app actually regains control.

  2. In particular, 1 interacts in really confusing ways with fibers in Wren where the VM wants to return control to the host app while Wren calls are themselves suspended.

  3. In particular, I'm worried 2 will end up with users overflowing the C stack without realizing it by using Wren fibers in ways that would be fine in pure Wren but will continue to build up C stack frames if they aren't careful.

Overall, I think it's just a simpler, cleaner mental model if Wren manages its own stacks entirely independently of the C stack and you don't have multiple interleaved native C function calls and wrenInterpret() calls on the stack.

But, looking at this issue, I think it's pretty clear that reasonable users expect to be able to:

  1. Use wrenCall() to start the VM executing some code. This is the main expected way to pass control to Wren from a host application.

  2. In that Wren code, create new instances of classes, some of which may be foreign classes, which in turn:

  3. Call into the foreign constructor.

  4. Which can then use the host API as normal even though we are in the middle of a wrenCall() as well.

So I think some level of re-entrancy will be required. 😧

Like you note, I'm really worried about the speed impact of changing how stackTop and apiStack work. The solution you suggest sounds reasonable to me and is probably the right way to go about it. I need to think this through and reinvent the solution myself to make sure I fully understand the implications, but my hunch is I'll end up at the same place your are already at.

Now if we go that way, and want to restore speed back, we have to expose ObjFiber* to the public API, but we have to change the semantic of some API calls. Incidentally Fibers should somehow now about their VM, so that we don't have to pass VM pointers all around the C stack.

I'm not sure about this. The VM does know the current fiber, so we can already get from a WrenVM to the relevant fiber. I think you're really asking for trouble if you try to switch fibers in the middle of re-entrant use of the VM, so the current setup may be adequate where we pass around a VM and from there get to the fiber.

It's been a little while since I've had all of this code loaded into my head but I'll put some time into this and try to get it sorted it out. The vm->apiStack always felt a little inelegant to me, so this might be a good opportunity to do something better.

mhermier commented 6 years ago

I just thought about a extra solution to try to mitigate the problem while still going ObjFiber::stackStart.

When entering the foreign method, a copy of ObjFiber::stackStart (and also possibly ObjFiber::stackTop) could be cached in WrenVM. Since theses value should change on slow paths, refreshing the cached values should not impact much, while we still remove the extra indirection.

Speaking of that wrenEnsureSlots issues, we forget cleanup the garbage from the stack before growing it. It means there is a possibitly that the gc gets called with possibly destroyed values references. So we need to do:

   wrenEnsureStack(vm, vm->fiber, needed);

+  for (Value* i = fiber->stackTop; i < fiber->apiStack + numSlots; i++)
+    *i = NULL_VAL;

   vm->fiber->stackTop = vm->apiStack + numSlots;
 }

(Possibly broken since I manually edited it live to look like a diff)