Tighter object representation in Wren?

crides commented 6 years ago

This is obviously not an existing issue in Wren. I just want to discuss a bit about NaN-tagging. I just had some thoughts about how Wren can better arrange its values, and I hope that helps.

NaN tagging is a great technique for stuffing different object types into one double value. I noticed that the comments here said that if the highest mantissa bit is not set, then that NaN is a "signaling" one and is used to halt execution. But I just try out myself using the code below, and also checked on Wikipedia. It turns out that a double with just all the exponent bits set is a Infinity, and it seems like in C at least, that it doesn't stop execution when performing double arithmetics.

The code:

#include <stdio.h>

typedef union {
    unsigned long long i;
    double f;
} test_t;

double double_with_bits(unsigned long long i) {
    test_t t;
    t.i = i;
    return t.f;
}

int main() {
    double test1 = double_with_bits(0x7FF0000000000000ULL);
    double test2 = double_with_bits(0x7FF8000000000000ULL);

    printf("%f\n", test1 + 1);
    printf("%f\n", test2 + 1);
    return 0;
}

And the corresponding output:

1.#INF00
1.#QNAN0

And what I think about that? Well, that means in NaN values, we've got anything except the exponents to play with, which is 53 bits. Leaving the least significant 48 bits for pointers, you still have 5 bits, which is probably enough to stuff all the different types you have in Wren. Even if you want consecutive ones, you still have 4 bits. Anyways, putting the types directly in the values can probably boost your type-checking speeds since you don't need to de-reference the pointer.

This is just some of my thoughts when skimming through Wren's source code, and I hope that helps.

munificent commented 6 years ago

We already store Booleans and the singleton null value directly in the NaN value, along with the type tag to indicate that's what we're doing. We can't fit any of the other built-in types because they have additional data they need to carry with them. Range is the smallest value and it still needs room for two full doubles.

We could potentially store small strings inline — many VMs do this optimization — but it's not clear that that will give us enough performance to be worth the complexity. It might actually be better to simply intern all strings like Lua does.

crides commented 6 years ago

Okay, I probably haven't made my point clear. I know about the singleton storage stuff and the objects themselves (maps, lists, ranges etc.) are pretty big compared to a double. What I meant is that you can store the type of the Obj* (maps, lists, ranges, class objects) into the remaining 4 or 5 bits.

munificent commented 6 years ago

What I meant is that you can store the type of the Obj* (maps, lists, ranges, class objects) into the remaining 4 or 5 bits.

Ah, you mean get rid of ObjType and merge that with ValueType? That might be possible, but I think it will cause more problems than it solves. When the GC iterates through the object list, it only has the Obj, not a Value. It still needs to be able to tell what kind of object it has. So if we hoist the object type out of Obj itself, we have to figure out how it can access that. We might be able to change the links from Obj* to Value to carry that around, but it's not clear that actually solves any problems.

ruby0x1 commented 6 years ago

Smaller footprint might improve performance? But also say if you had to track the Value by storing it in the Obj that adds weight back, so it might not be a win... only way to know is measure. It's an interesting idea

munificent commented 6 years ago

Smaller footprint might improve performance?

It might, yes. But Obj already needs at least a couple of bits in the header for GC marking, so any benefit of removing ObjType is probably lost to padding anyway. Still worth investigating though. :)

mhermier commented 6 years ago

ObjType and GC flags should be merged in a bit field if we really are desesparate for object space.

crides commented 6 years ago

Actually, today I just did a simple test. I add the following to src/cli/main.c just to test the size of the primitive objects:

#include "../vm/wren_value.h"

#define print_size(o) printf("sizeof(%s) -> %lu\n", #o, sizeof(o))

int main(int argc, const char* argv[])
{
  print_size(Obj);
  print_size(ObjClass);
  print_size(ObjClosure);
  print_size(ObjFiber);
  print_size(ObjFn);
  print_size(ObjForeign);
  print_size(ObjInstance);
  print_size(ObjList);
  print_size(ObjMap);
  print_size(ObjModule);
  print_size(ObjRange);
  print_size(ObjString);
  print_size(ObjType);
  print_size(ObjUpvalue);
...

And when I make again and run wren, I get the following (on my Ubuntu 16.04 x86_64):

sizeof(Obj) -> 24
sizeof(ObjClass) -> 64
sizeof(ObjClosure) -> 32
sizeof(ObjFiber) -> 96
sizeof(ObjFn) -> 88
sizeof(ObjForeign) -> 24
sizeof(ObjInstance) -> 24
sizeof(ObjList) -> 40
sizeof(ObjMap) -> 40
sizeof(ObjModule) -> 64
sizeof(ObjRange) -> 48
sizeof(ObjString) -> 32
sizeof(ObjType) -> 4
sizeof(ObjUpvalue) -> 48

So it does seem like we have a lot to improve...

mhermier commented 6 years ago

Considering it is x86_64 it is not that bad.

minirop commented 6 years ago

Crides, don't forget those are in bytes, not bits (a double is 8 bytes), how would you improve those?

ruby0x1 commented 6 years ago

as is a pointer which is mhermier's point too, a pointer is 8 already

mhermier commented 6 years ago

We can optimise the bigger object a little, by rearranging a little the arguments. But because of modern CPU design and alignment issue, I'm not sure we can gain enough and that it won't be too much CPU specific. For a first test, maybe we can test struct packing so we have some numbers.

crides commented 6 years ago

For a first test, maybe we can test struct packing so we have some numbers.

I don't know how it would improve either, so maybe I'll just do a little test by packing the structs some time later and compare the results.

mhermier commented 6 years ago

There is one complex way to save 2* 8 bits in Obj.

The idea is that the memory allocator needs to know the size of current object and the next chunk of allocated memory to operate. So given a valid pointer we can ask the memory allocator these infos back. But there are a huge number of limitations and portability considerations.

The allocator can report the wrong size because it is a slab/slub allocator. That will only report power of 2 sizes at expensive cost. The allocator might return pointers to other other allocated objects, if it is a system wide allocator.

So the allocator can still be a good solution, if it mmap huge chunk of memory and manage objects lifetime etc... Or even if it is just a fake allocator, the memory footprint is equivalent to what we have now. But it comes with some complexity added to the memory interface, since we also need to make the allocator responsible of the garbage collection.

So in addition of the current wren relocate function, we need to have something like:

void* (*WrenAllocatorAllocate)(void* allocator, size_t size);
size_t (*WrenAllocatorAllocated)(void* allocator, void* ptr);
size_t (*WrenAllocatorCollectGarbage)(void* ptr);
void (*WrenAllocatorSetRoot)(void* allocator, void* ptr);
void (*WrenAllocatorUnsetRoot)(void* allocator, void* ptr);

In the facts it means the memory management should be extracted of the current vm. It hides some parts of the allocations outside of Obj. At worst case, we still have the same memory footprint as current, at best it save 2 pointers per Obj (if used with a custom allocator). But the biggest cons are:

it hides some data from normal debugging.
writing a compliant allocator became complex (even if we can still extract a default allocator of current code with no real improvement on the memory footprint).
default allocator give the illusion we saved some bytes, but the situation is identical to now. The biggest pro are:
2 pointers per object can be saved, if we are on a really constraints system it can means a lot more object.
an efficient/different allocation strategy can be made, with different garbage collection strategies.
decouple it from WrenRealloc allocation strategy.

All in all, it can be a good change, but it will make debugging so much painful...

crides commented 6 years ago

Okay, I've done my little test. It seems like we don't have to go that far like @mhermier said. Basically I just did some small twists in wren_value.h (the diff is here) and tested the results using this awk script. The tested wren script is just to calculate the fibonacci of 34, using the recursive method.

And the output of the test script is:

./wren__fpack-struct.exe : 4.1537s (total: 83.074s)
./wren__manual_pack.exe : 4.14375s (total: 82.875s)
./wren__orig.exe : 4.627s (total: 92.54s)

Note:

wren__orig is built using the lastest master branch; wren__manual_pack is built with the small twist I made; wren__fpack-struct is built with the twist and the -fpack-struct option in GCC.
If you don't bother to view the test script yourself, it just runs the 3 tests 20 times each, in an interleaving pattern, and sums the time up and averages them.
This test is very simple, so the differences in performace is probably just related to the changes toward the top of wren_value.h. The rest of the changes might not be what the language intended. But it is not related to this test anyways.

So it seems like packing the struct's does have visible boosts to the efficiency (about a little more than 10%). But we still need to have more comprehensive test cases to draw a conclusion.

mhermier commented 6 years ago

Could you just do a test with the Obj change? And try to change the order of type and isDark member.

crides commented 6 years ago

Sure, and here are the results (I left the executables from yesterday just for comparison):

./wren__fpack-struct.exe : 4.41645 (total: 88.329)
./wren__manual_pack.exe : 4.4341 (total: 88.682)
./wren__obj_only.exe : 4.5117 (total: 90.234)
./wren__obj_only_fps.exe : 4.2708 (total: 85.416)
./wren__orig.exe : 4.64305 (total: 92.861)

I don't know why, but the gap it seems like the gap isn't as big as yesterday. Note: wren__obj_only is built with only the change in the Obj struct (order of the struct in the source is the same as the original), and wren__obj_only_fps is built with the -fpack-struct switch.

mhermier commented 6 years ago

Could you show the test ? There might be some flows in it:

wren way of making test is portable but not really reliable, depends of the time on the wall not the execution time.
since it should do mostly arithmetics you only see the effects on a relatively small portion of the vm code, with a very limited number of objects. It doesn't mean it is a bad test, but only shows a small view of the effects

Anyway did you tried to adapt wren benchmarks to run these ? (see util/benchmark.py)

Oddly the orig seems stable, so I wonder what went wrong...

crides commented 6 years ago

This is what I put in my test script:

var fibo
fibo = Fn.new {|n|
    if (n <= 1) {
        return 1
    } else {
        return fibo.call(n - 1) + fibo.call(n - 2)
    }
}

var start = System.clock
fibo.call(34)
System.print(System.clock - start)

For the benchmark system in wren, I haven't do anything about it yet. But I'll try to adapt to it. The result also seem wierd for me, because I just added the wren__obj_only* ones today without changing the others ...

mhermier commented 6 years ago

Just to be sure (re)move the new test, to see if numbers stay the same. Just a stupid idea...

munificent commented 6 years ago

This is very interesting data, thanks! I like the idea of using smaller numbers where it makes sense to do so. I was a little worried that that could harm performance since CPUs sometimes end up having to pad out to a full word when operating on them, but that seems like it might not be the case.

crides commented 6 years ago

@mhermier I did the tests. I don't know what happened on the first day I did the tests that make the gaps that big, but here are the results:

./wren__fpack-struct.exe : 3.9655 (total: 79.31)
./wren__manual_pack.exe : 4.0299 (total: 80.598)
./wren__orig.exe : 4.156 (total: 83.12)

And just in case you also want to see how it goes if I add the new ones back in, here are those:

./wren__fpack-struct.exe : 3.95765 (total: 79.153)
./wren__manual_pack.exe : 4.0327 (total: 80.654)
./wren__obj_only.exe : 4.09605 (total: 81.921)
./wren__obj_only_fps.exe : 3.9255 (total: 78.51)
./wren__orig.exe : 4.15815 (total: 83.163)

I don't know why they all run better than yesterday, but the gaps seem to be just as big (or as small). So I guess today's tests and yesterday's ones should be a descent representation of how well is the optimization.

crides commented 6 years ago

@munificent I think it would just be better if you can make the structures smaller without making the code that stores/accesses the data more complicated. And it is not just CPU words (especially if you just pass the pointers around). It might eventually come down to memory (de)allocation and the GC.

mhermier commented 6 years ago

Well situation is a little bit more complex than that. While modern CPU allow unaligned access at almost zero cost, on older one it is a huge bottleneck. So packing is one way to see if we can have some improvement, but it has to be thinked before massively moving parameters around structs.

wren-lang / wren

Tighter object representation in Wren? #583