Closed crides closed 5 years ago
We already store Booleans and the singleton null
value directly in the NaN value, along with the type tag to indicate that's what we're doing. We can't fit any of the other built-in types because they have additional data they need to carry with them. Range is the smallest value and it still needs room for two full doubles.
We could potentially store small strings inline — many VMs do this optimization — but it's not clear that that will give us enough performance to be worth the complexity. It might actually be better to simply intern all strings like Lua does.
Okay, I probably haven't made my point clear. I know about the singleton storage stuff and the objects themselves (maps, lists, ranges etc.) are pretty big compared to a double
. What I meant is that you can store the type of the Obj*
(maps, lists, ranges, class objects) into the remaining 4 or 5 bits.
What I meant is that you can store the type of the Obj* (maps, lists, ranges, class objects) into the remaining 4 or 5 bits.
Ah, you mean get rid of ObjType
and merge that with ValueType
? That might be possible, but I think it will cause more problems than it solves. When the GC iterates through the object list, it only has the Obj
, not a Value
. It still needs to be able to tell what kind of object it has. So if we hoist the object type out of Obj
itself, we have to figure out how it can access that. We might be able to change the links from Obj*
to Value
to carry that around, but it's not clear that actually solves any problems.
Smaller footprint might improve performance? But also say if you had to track the Value by storing it in the Obj that adds weight back, so it might not be a win... only way to know is measure. It's an interesting idea
Smaller footprint might improve performance?
It might, yes. But Obj already needs at least a couple of bits in the header for GC marking, so any benefit of removing ObjType is probably lost to padding anyway. Still worth investigating though. :)
ObjType and GC flags should be merged in a bit field if we really are desesparate for object space.
Actually, today I just did a simple test. I add the following to src/cli/main.c
just to test the size of the primitive objects:
#include "../vm/wren_value.h"
#define print_size(o) printf("sizeof(%s) -> %lu\n", #o, sizeof(o))
int main(int argc, const char* argv[])
{
print_size(Obj);
print_size(ObjClass);
print_size(ObjClosure);
print_size(ObjFiber);
print_size(ObjFn);
print_size(ObjForeign);
print_size(ObjInstance);
print_size(ObjList);
print_size(ObjMap);
print_size(ObjModule);
print_size(ObjRange);
print_size(ObjString);
print_size(ObjType);
print_size(ObjUpvalue);
...
And when I make
again and run wren
, I get the following (on my Ubuntu 16.04 x86_64):
sizeof(Obj) -> 24
sizeof(ObjClass) -> 64
sizeof(ObjClosure) -> 32
sizeof(ObjFiber) -> 96
sizeof(ObjFn) -> 88
sizeof(ObjForeign) -> 24
sizeof(ObjInstance) -> 24
sizeof(ObjList) -> 40
sizeof(ObjMap) -> 40
sizeof(ObjModule) -> 64
sizeof(ObjRange) -> 48
sizeof(ObjString) -> 32
sizeof(ObjType) -> 4
sizeof(ObjUpvalue) -> 48
So it does seem like we have a lot to improve...
Considering it is x86_64 it is not that bad.
Crides, don't forget those are in bytes, not bits (a double is 8 bytes), how would you improve those?
as is a pointer which is mhermier's point too, a pointer is 8 already
We can optimise the bigger object a little, by rearranging a little the arguments. But because of modern CPU design and alignment issue, I'm not sure we can gain enough and that it won't be too much CPU specific. For a first test, maybe we can test struct packing so we have some numbers.
For a first test, maybe we can test struct packing so we have some numbers.
I don't know how it would improve either, so maybe I'll just do a little test by packing the structs some time later and compare the results.
There is one complex way to save 2* 8 bits in Obj.
The idea is that the memory allocator needs to know the size of current object and the next chunk of allocated memory to operate. So given a valid pointer we can ask the memory allocator these infos back. But there are a huge number of limitations and portability considerations.
The allocator can report the wrong size because it is a slab/slub allocator. That will only report power of 2 sizes at expensive cost. The allocator might return pointers to other other allocated objects, if it is a system wide allocator.
So the allocator can still be a good solution, if it mmap huge chunk of memory and manage objects lifetime etc... Or even if it is just a fake allocator, the memory footprint is equivalent to what we have now. But it comes with some complexity added to the memory interface, since we also need to make the allocator responsible of the garbage collection.
So in addition of the current wren relocate function, we need to have something like:
void* (*WrenAllocatorAllocate)(void* allocator, size_t size);
size_t (*WrenAllocatorAllocated)(void* allocator, void* ptr);
size_t (*WrenAllocatorCollectGarbage)(void* ptr);
void (*WrenAllocatorSetRoot)(void* allocator, void* ptr);
void (*WrenAllocatorUnsetRoot)(void* allocator, void* ptr);
In the facts it means the memory management should be extracted of the current vm. It hides some parts of the allocations outside of Obj. At worst case, we still have the same memory footprint as current, at best it save 2 pointers per Obj (if used with a custom allocator). But the biggest cons are:
All in all, it can be a good change, but it will make debugging so much painful...
Okay, I've done my little test. It seems like we don't have to go that far like @mhermier said. Basically I just did some small twists in wren_value.h
(the diff is here) and tested the results using this awk script. The tested wren
script is just to calculate the fibonacci of 34, using the recursive method.
And the output of the test script is:
./wren__fpack-struct.exe : 4.1537s (total: 83.074s)
./wren__manual_pack.exe : 4.14375s (total: 82.875s)
./wren__orig.exe : 4.627s (total: 92.54s)
Note:
wren__orig
is built using the lastest master
branch; wren__manual_pack
is built with the small twist I made; wren__fpack-struct
is built with the twist and the -fpack-struct
option in GCC.wren_value.h
. The rest of the changes might not be what the language intended. But it is not related to this test anyways.So it seems like packing the struct
's does have visible boosts to the efficiency (about a little more than 10%). But we still need to have more comprehensive test cases to draw a conclusion.
Could you just do a test with the Obj change? And try to change the order of type and isDark member.
Sure, and here are the results (I left the executables from yesterday just for comparison):
./wren__fpack-struct.exe : 4.41645 (total: 88.329)
./wren__manual_pack.exe : 4.4341 (total: 88.682)
./wren__obj_only.exe : 4.5117 (total: 90.234)
./wren__obj_only_fps.exe : 4.2708 (total: 85.416)
./wren__orig.exe : 4.64305 (total: 92.861)
I don't know why, but the gap it seems like the gap isn't as big as yesterday.
Note: wren__obj_only
is built with only the change in the Obj
struct (order of the struct in the source is the same as the original), and wren__obj_only_fps
is built with the -fpack-struct
switch.
Could you show the test ? There might be some flows in it:
Anyway did you tried to adapt wren benchmarks to run these ? (see util/benchmark.py)
Oddly the orig seems stable, so I wonder what went wrong...
This is what I put in my test script:
var fibo
fibo = Fn.new {|n|
if (n <= 1) {
return 1
} else {
return fibo.call(n - 1) + fibo.call(n - 2)
}
}
var start = System.clock
fibo.call(34)
System.print(System.clock - start)
For the benchmark system in wren, I haven't do anything about it yet. But I'll try to adapt to it.
The result also seem wierd for me, because I just added the wren__obj_only*
ones today without changing the others ...
Just to be sure (re)move the new test, to see if numbers stay the same. Just a stupid idea...
This is very interesting data, thanks! I like the idea of using smaller numbers where it makes sense to do so. I was a little worried that that could harm performance since CPUs sometimes end up having to pad out to a full word when operating on them, but that seems like it might not be the case.
@mhermier I did the tests. I don't know what happened on the first day I did the tests that make the gaps that big, but here are the results:
./wren__fpack-struct.exe : 3.9655 (total: 79.31)
./wren__manual_pack.exe : 4.0299 (total: 80.598)
./wren__orig.exe : 4.156 (total: 83.12)
And just in case you also want to see how it goes if I add the new ones back in, here are those:
./wren__fpack-struct.exe : 3.95765 (total: 79.153)
./wren__manual_pack.exe : 4.0327 (total: 80.654)
./wren__obj_only.exe : 4.09605 (total: 81.921)
./wren__obj_only_fps.exe : 3.9255 (total: 78.51)
./wren__orig.exe : 4.15815 (total: 83.163)
I don't know why they all run better than yesterday, but the gaps seem to be just as big (or as small). So I guess today's tests and yesterday's ones should be a descent representation of how well is the optimization.
@munificent I think it would just be better if you can make the structures smaller without making the code that stores/accesses the data more complicated. And it is not just CPU words (especially if you just pass the pointers around). It might eventually come down to memory (de)allocation and the GC.
Well situation is a little bit more complex than that. While modern CPU allow unaligned access at almost zero cost, on older one it is a huge bottleneck. So packing is one way to see if we can have some improvement, but it has to be thinked before massively moving parameters around structs.
This is obviously not an existing issue in Wren. I just want to discuss a bit about NaN-tagging. I just had some thoughts about how Wren can better arrange its values, and I hope that helps.
NaN tagging is a great technique for stuffing different object types into one
double
value. I noticed that the comments here said that if the highest mantissa bit is not set, then that NaN is a "signaling" one and is used to halt execution. But I just try out myself using the code below, and also checked on Wikipedia. It turns out that a double with just all the exponent bits set is a Infinity, and it seems like in C at least, that it doesn't stop execution when performing double arithmetics.The code:
And the corresponding output:
And what I think about that? Well, that means in NaN values, we've got anything except the exponents to play with, which is 53 bits. Leaving the least significant 48 bits for pointers, you still have 5 bits, which is probably enough to stuff all the different types you have in Wren. Even if you want consecutive ones, you still have 4 bits. Anyways, putting the types directly in the values can probably boost your type-checking speeds since you don't need to de-reference the pointer.
This is just some of my thoughts when skimming through Wren's source code, and I hope that helps.