implement a fast general purpose allocator

andrewrk commented 2 years ago

Requirements:

faster than glibc malloc
faster than musl malloc
faster than mingw-w64 malloc
faster than jemalloc
faster than mimalloc

Once it has been verified for correctness, fuzz tested, and tested with various third party applications such as web browsers using LD_PRELOAD, it should become the default implementation used by std.heap.GeneralPurposeAllocator when optimization mode is ReleaseFast.

Good luck, have fun

Jarred-Sumner commented 2 years ago

I love this idea

WebKit's memory allocator is interesting https://github.com/WebKit/WebKit/blob/main/Source/bmalloc/libpas/Documentation.md

I think Zig could do something especially cool here since it knows the type being allocated at comptime

Techcable commented 2 years ago

faster than [all other mallocs]

This may be tough to do 😉

cipharius commented 2 years ago

I decided to take a look at recent publications on memory allocators and I stumbled upon this masters thesis: https://uwspace.uwaterloo.ca/handle/10012/18329

The author experimented with their own memory allocator, but besides that they did a pretty thorough study on performance of other popular and state of the art memory allocators. In the conclusion the author states that rpmalloc often landed as the fastest memory allocator.

The rpmalloc project seems pretty sensible, so it might be a good idea to bring it's design over to Zig.

https://github.com/mjansson/rpmalloc

andrewrk commented 2 years ago

I think starting with a port of rpmalloc would be an excellent strategy.

cipharius commented 1 year ago

I've been further investigating state of allocators and snmalloc caught my attention. It's up there with rpmalloc performance, but it's message passing design seems solid for multi-threaded allocation/deallocation pipelines and unlike rpmalloc it has a paper describing it's design details, so don't have to rely only on the source code.

Anyone got opinion on snmalloc?

somethingelseentirely commented 1 year ago

I feel like mesh would also be an interesting contender for this. It manages to reduce the memory footprint significantly, and feels like it would fit zig.

https://github.com/plasma-umass/Mesh https://www.youtube.com/watch?v=xb0mVfnvkp0&ab_channel=MicrosoftResearch

Edit: Probably disqualified because it heavily relies on virtual memory, which makes it unsuitable for use in wasm.

andrewrk commented 1 year ago

It's ok to use virtual memory. Wasm can use a different implementation

darowny commented 1 year ago

What is wrong with porting mimalloc? Why it can't be used?

andrewrk commented 1 year ago

mimalloc looks fine. Porting it would be a great start.

stephc-int13 commented 1 year ago

If it my belief that good engineering is finding the best tradeoff. In this spirit, we should always optimize for something. In most cases we should optimize for simplicity.

Optimizing for speed, as suggested here, is in my opinion contradictory with the nature of a general-purpose allocator.

Maybe the goal should be more clearly defined, like optimizing for reasonable speed in the worst case, or the average case, but given the wide range of possible workloads for a general-purpose allocator, even defining what could be the worst case might be difficult or impossible.

If I had to write a general-purpose allocator the design goals would be simple, slow, no surprises.

mlugg commented 1 year ago

The existing GeneralPurposeAllocator implementation basically checks those boxes: it's fairly simple, and has debugging advantages, but it's slow. But my understanding is that this implementation wouldn't be going anywhere - it'd still be used in safe release modes. The fast one, as the issue says, would specifically be for ReleaseFast builds. That means you get the best of both worlds - safety and simplicity while developing and debugging, and good runtime performance if and when you need it.

I agree the goal of speed could be better defined, but imo it's clear enough that we do need some faster GPA implementation, as the current one has performance issues which I myself have hit before (which have caused me to have to fallback to c_allocator sometimes). Having an allocator in the standard library with good performance characteristics for normal use cases seems like a pretty reasonable ask.

cryptocode commented 1 year ago

Keeping security in mind while developing the new one would be nice. There have been plenty of CVE's related to flawed heap allocators.

Some example attack vectors for inspiration:

how2heap: https://github.com/shellphish/how2heap

Owning Firefox's heap by exploiting jemalloc: https://speakerdeck.com/argp/exploiting-the-jemalloc-memory-allocator-owning-firefoxs-heap?slide=16

How to Exploit Dlmalloc Unlink: https://www.davidxia.com/2020/04/how-to-exploit-dlmalloc-unlink/

tcmalloc attacks: https://blog.infosectcbr.com.au/2019/12/attacks-on-tcmalloc-heap-allocator.html

mimalloc attacks: https://github.com/microsoft/mimalloc/issues/372

mjp41 commented 1 year ago

If you are interested in adding security related features, you might want to look at some of the research we have done on snmalloc: https://github.com/microsoft/snmalloc/blob/main/docs/security/README.md

Most of the ideas are portable, but certain design choices make that easier than others.

jedisct1 commented 1 year ago

OpenBSD's malloc may be a good starting point until we implement something more complicated.

A deep dive into OpenBSD malloc internals

It's clean and short, fits in a single file, and doesn't seem to do anything that would be difficult to port to Zig: https://github.com/openbsd/src/blob/master/lib/libc/stdlib/malloc.c

Of course, it comes with security features, separates data and metadata, and leverages randomization as much as possible.

Single-thread performance is good. Mutexes will hurt in highly concurrent scenarios, but Zig makes it easy to have per-thread allocators.

It can be replaced with something more complex later on.

/cc @omoerbeek

ghost commented 1 year ago

https://blog.chromium.org/2021/04/efficient-and-safe-allocations-everywhere.html

PartitionAlloc is Chromium’s memory allocator, designed for lower fragmentation, higher speed, and stronger security and has been used extensively within Blink (Chromium’s rendering engine). In Chrome 89 the entire Chromium codebase transitioned to using PartitionAlloc everywhere (by intercepting and replacing malloc() and new) on Windows 64-bit and Android. Data from the field demonstrates up to 22% memory savings, and up to 9% improvement in responsiveness and scroll latency of Chrome.

It does not have a lot of security features but it seems simple

rganesan commented 1 year ago

mimalloc looks fine. Porting it would be a great start.

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper? The reason I ask is mimalloc is still maintained and looks like well written C code. Any improvements and/or bug fixes will go into the upstream version and a Zig port will need to track and port over the changes.

On the flip side it doesn't look like a lot of changes are going in, so a port wouldn't be hard to maintain and has a better chance of landing as a GPA in the zig source base.

cryptocode commented 1 year ago

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper?

The new allocator is supposed to be a fast default allocator for ReleaseFast, and I assume wrapping mimalloc would require libc which sounds like a no-go.

rganesan commented 1 year ago

On Thu, Nov 24, 2022 at 3:35 PM cryptocode @.***> wrote:

I had a question on this. Do we absolutely need to port mimalloc to zig as opposed to a Zig wrapper?

The new allocator is supposed to be a fast default allocator for ReleaseFast, and I assume wrapping mimalloc would require libc which sounds like a no-go.

Oh, good point. I didn't think of that. I think mimalloc and rpmalloc seem to be the best candidates for an initial port for two reasons. They're written in C and the codebases are relatively small. I'd like to give this a shot but not duplicate the effort if anyone is already working on a port of either of these two.

Ganesan

-- Ganesan Rajagopal

mjp41 commented 1 year ago

Most allocators depend very little on libc. Even though snmalloc is written in C++ it doesn't need a C++ library at runtime. If you want to LD_PRELOAD, then you can't touch very much libc or libcxx and not run into trouble.

With snmalloc, almost all the libc dependencies are in confined to two places, a platform abstraction, and the thread local state:

This means we can run as the allocator inside a (Uni)kernel, inside SGX enclaves, and on other non-standard platforms where we can't depend on libc existing, or very much libc existing.

InKryption commented 1 year ago

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation). Any feedback is welcome.

rganesan commented 1 year ago

On Fri, Dec 2, 2022 at 4:07 AM InKryption @.***> wrote:

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

This is nice. Thank you for doing this. I thought mimalloc would be a better choice for GPA compared to rpmalloc if only because it has additional debug capabilities (like guard pages) which would be nice for debug builds.

However mimalloc has a lot more moving parts and I thought it would be worth attempting an rpmalloc port myself. I think it would make an excellent choice for release builds.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation).

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

Any feedback is welcome.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

-- Ganesan Rajagopal

InKryption commented 1 year ago

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

After having spent a reasonable amount of time sleuthing the code structure, it unfortunately doesn't seem like it. First class heaps are still tied into the same general structure involving the thread local state, so ultimately I decided to omit porting that feature, as I don't see a viable way that it could be used in the Allocator interface, so it ends up just being extra overhead.

rganesan commented 1 year ago

On Fri, Dec 2, 2022 at 9:37 PM InKryption @.***> wrote:

Wouldn't RPMALLOC_FIRST_CLASS_HEAPS in the original C implementation address this?

After having spent a reasonable amount of time sleuthing the code structure, it unfortunately doesn't seem like it. First class heaps are still tied into the same general structure involving the thread local state, so ultimately I decided to omit porting that feature, as I don't see a viable way that it could be used in the Allocator interface, so it ends up just being extra overhead.

I see. Another reason first class heaps is a no go is because of the assumption that only one thread can access the heap functions at a time. We'll have to lock and that defeats the purpose of a lockfree implementation.

However, I didn't understand what's the concern with thread local state. If I understand your implementation correctly, you have already made thread_heap non-global and return it as part of the Allocator struct. I'm new to Zig, so please excuse me if I'm missing something obvious.

Ganesan

-- Ganesan Rajagopal

rganesan commented 1 year ago

On Fri, Dec 2, 2022 at 4:07 AM InKryption @.***> wrote:

I've come up with what is basically a working port of rpmalloc: https://github.com/InKryption/rpmalloc-zig-port

I still have yet to benchmark it - hopefully it should yield similar results to the source material.

The only unfortunate thing about it is that the design seems to rely pretty heavily on global state, and uses a threadlocal variable, so I'm not 100% certain on how this could integrate into GeneralPurposeAllocator (whose state exists independently per instantiation). Any feedback is welcome.

I want to revisit this question as I'm looking at wrapping mimalloc and it also relies pretty heavily on global slate. Both rpmalloc and mimalloc have dedicated page pools for different bucket sizes. They also rely on per-thread heaps for performance reasons. Maintaining a per-instance state would be quite expensive IMHO.

Is there a good argument to maintain state independently per instantiation? I can see that being useful and necessary for something like an arena allocator. But rpmalloc/mimalloc are meant to be general purpose allocators, can't they be treated as singletons?

Ganesan -- Ganesan Rajagopal

Validark commented 1 year ago

As far as I know, the best allocator is the one that best suits your particular application's needs. Hence why there are so many projects that have a custom allocator that defeats every alternative, for their project. Here's one that has yet to be mentioned: Roblox's custom Lua(u) allocator (I think the source files are here and here) that beats rpmalloc, mimalloc, jemalloc, ptmalloc and tcmalloc through domain-specific tuning (source). Thus, I think there should be an allocator which decides which strategies to use based on how it is actually used by the application.

I don't know much about allocators, but I imagine the Zig allocator should be able to determine at compile-time what the allocation workload is like by looking at allocation logs from previous runs and generating a tuned allocator specifically to meet those needs. That way, everyone gets the best possible allocator for their use-case?

cipharius commented 1 year ago

As far as I know, the best allocator is the one that best suits your particular application's needs. Hence why there are so many projects that have a custom allocator that defeats every alternative, for their project. Here's one that has yet to be mentioned: Roblox's custom Lua(u) allocator (I think the source files are here and here) that beats rpmalloc, mimalloc, jemalloc, ptmalloc and tcmalloc through domain-specific tuning (source). Thus, I think there should be an allocator which decides which strategies to use based on how it is actually used by the application.

I don't know much about allocators, but I imagine the Zig allocator should be able to determine at compile-time what the allocation workload is like by looking at allocation logs from previous runs and generating a tuned allocator specifically to meet those needs. That way, everyone gets the best possible allocator for their use-case?

This doesn't make too much sense since Zig is already designed around idea of customizable allocators. If you need a specialized allocator you can already write one and easily use it in some parts or in whole Zig project.

This thread is specifically dedicated to GeneralPurposeAllocator ReleaseFast mode, so it's supposed to be generic solution for maximum performance target, which may neglect safety features for sake of performance and work well in most cases.

kassane commented 1 year ago

Note: Not enough baggage to argue about.

I recently viewed a benchmark of memory allocations apparently performed in the Windows environment.

I chose five allocators to compare.

dlmalloc – previously popular

jemaloc – industrial strength, tuned for long running servers

mimalloc – general purpose allocator by Microsoft

rpmalloc – lockfree GPA by a now Epic Games employee

tlsf – pool allocator used in some games

https://www.forrestthewoods.com/blog/benchmarking-malloc-with-doom3/

silversquirl commented 1 year ago

I'd like to note that it's very easy to write an allocator that's much, much faster than the current GeneralPurposeAllocator. My BinnedAllocator (<200 lines, excluding tests) is an order of magnitude faster, and can trade blows with glibc malloc and even mimalloc on some benchmarks.

I'd like to do more benchmarking to get better real-world performance numbers but haven't gotten around to it. If anyone can recommend a nice allocator benchmark to translate into Zig that'd be great!

Given that it's very easy to outspeed GPA, I'd suggest adding something to the standard library that's at least somewhat usable for performance-conscious applications, even if it's not as fast as more complex options. Those more advanced allocators can be discussed and implemented later, and can replace the simpler implementation. I'm happy to PR my BinnedAllocator if that's wanted :)

silversquirl commented 1 year ago

I'd also like to suggest renaming the current GeneralPurposeAllocator to DebugAllocator and introducing a new FastAllocator (or maybe DebugGeneralAllocator and FastGeneralAllocator?). std.heap.GeneralPurposeAllocator can then refer to either DebugAllocator or FastAllocator depending on the build mode.

This allows people to choose which one to use if they want, and separates the code for the two allocators (as I doubt much of it will be shared).

mlugg commented 1 year ago

Andrew has previously mentioned that use of the term fast is to be discouraged (https://github.com/ziglang/zig/pull/13002#issuecomment-1264287533), and I agree with him; it's not a helpful term in this case, as it tells you nothing about the functionality or intended usage of the allocator (don't I ideally want all my allocations to be fast?). Renaming the current allocator to DebugAllocator probably makes sense (making it fast in release builds seems like it might be a lost cause without a lot of release-specific logic), but I think the alternative(s) should have more descriptive name(s), like your BinnedAllocator.

On that note, I think that allocator would probably be a good fit for std, and as a temporary replacement for GPA in release modes until we have a better alternative. The trivial pathological cases space-wise present in binned allocators mean that, in my opinion, they're not a great fit for std's general purpose allocator long-term, but release-mode GPA is borderline unusable in any application with a lot of allocation at the minute, so BinnedAllocator would certainly be a good temporary solution.

rganesan commented 1 year ago

This may be a good time to mention my crude zig port of mimalloc. This is a direct port of the C code that I worked over on several days last December. It's really ugly and not really meant for public consumption, but it's functional. I did this mainly as a learning exercise to do a meaningful project in Zig.

I intended to clean it up and then announce it but I just haven't gotten time to spend on it. If anyone wants to take it forward or collaborate with me to make an idiomatic zig port, you're most welcome!

https://github.com/rganesan/mimalloc-zig-port

andrewrk commented 1 year ago

As a stepping stone, the next thing needed here is a test bed, kind of like SMHasher but for memory allocation, that can tell us about the performance characteristics of a provided Allocator.

I see a lot of people getting hung up on porting an existing allocator implementation, and while I think such things have inspiration to offer, a better solution will come from writing zig code from scratch and first principles. The Allocator API that zig provides offers simplicity and perf opportunities that are not possible with libc malloc/free API, specifically with the fact that the resize function is always in place and is allowed (and encouraged) to refuse to perform the resize if it cannot be done without incurring a cost (perhaps a cost of storing metadata that otherwise might not need to be stored).

I have no problem doing this myself; I already made one for WebAssembly which I'm happy with. Of course, caring about threads and returning memory to the OS makes things more complicated, but that's manageable. Anyway I just thought it would be fun to give Zig contributors a chance to work on this, but if nobody steps up - that's OK; I'll do it myself.

But if you want to work on something that is a bit more forgiving, try to come up with a project that measures the perf of an Allocator implementation. That would be extremely valuable for the Zig ecosystem.

dweiller commented 1 year ago

As a stepping stone, the next thing needed here is a test bed, kind of like SMHasher but for memory allocation, that can tell us about the performance characteristics of a provided Allocator.

mimalloc-bench could be a good starting point - I've been using it to test/benchmark an allocator. The allocator I've been working on is here - it's already much faster than the GPA, but getting generally competitive with the allocators in the OP or working on non-linux systems will probably require some help from others.

mtoons commented 8 months ago

I wanted to start writing the testbed for allocators, but i fist want to clarify

What is a realisitic scenario for the gpa specifically? should we target all uses cases or just the ones where it makes sense to use a gpa? for example should i write test for situation where an arena allocator would be the best option?

should gpa be good in any situation or should we expect users to choose their allocators carefully?

xdBronch commented 8 months ago

i cant speak for andrew or anyone else here but i think the name general purpose allocator is self explanatory, it should be good (but not necessarily the absolute best) in every situation. no matter how good your allocator is something like an arena will be the better option in some cases but imo you shouldnt need to cherry pick allocators for regular use

mtoons commented 8 months ago

I don't think this is that simple the ability to choose your allocator is a really important feature of zig and that having an allocator that aim to complement more specialised allocator instead of a polyvalent allocator doing it's own thing have it's advantages

joadnacer commented 7 months ago

Still very much WIP but here's my implementation of a performant General Purpose Allocator conforming to Zig allocator design (no global/threadlocal vars in default configuration). https://github.com/joadnacer/jdz_allocator

It also contains some dumb benchmarks in src/bench.zig, with results in the README comparing to c_allocator, rpmalloc zig port, GPA and zimalloc.

With some more work should be a viable candidate for the std GPA.

joadnacer commented 4 months ago

mimalloc-bench results of current zig allocators (jdz, zimalloc, rpmalloc-zig + standard malloc): https://pastebin.com/QDA2UW67

SilasLock commented 4 months ago

mimalloc-bench results of current zig allocators (jdz, zimalloc, rpmalloc-zig + standard malloc): https://pastebin.com/QDA2UW67

The results from mimalloc-bench are a little bit difficult to read. No pressure or anything, but do you have any plans to make a graphical representation of the results, to match the format of the ops/ms, RSS, and VSZ graphs on the jdz_allocator readme? Those are really easy to parse, and (even if the benchmark itself isn't perfectly realistic) give the reader a great sense of how the various different allocators stack up across various metrics.

nevakrien commented 4 months ago

if we are doing a new alocator can we please fix the error handling? right now its way too involved which causes ub in release mode... like specifcly on windows if you allocate too much memory and it errors it would call kernel32.GetLastError() which would also crash... causing ub.

the allocator should really not be doing these switch statements it should just allocate memory and return a simple error. at least in release mode that what it should do

joadnacer commented 2 months ago

mimalloc-bench results of current zig allocators (jdz, zimalloc, rpmalloc-zig + standard malloc): https://pastebin.com/QDA2UW67

The results from mimalloc-bench are a little bit difficult to read. No pressure or anything, but do you have any plans to make a graphical representation of the results, to match the format of the ops/ms, RSS, and VSZ graphs on the jdz_allocator readme? Those are really easy to parse, and (even if the benchmark itself isn't perfectly realistic) give the reader a great sense of how the various different allocators stack up across various metrics.

late reply to this but have added a chart version in the README at https://github.com/joadnacer/jdz_allocator/

allocator is working fairly well and already in use in @Validark's Accelerated Zig Parser, I think it's pretty robust functionally and would recommend using JdzGlobalAllocator over GPA/c_allocator if happy with having to manually call deinitThread() for multi-threaded apps (although not necessary if threads aren't created/destroyed dynamically).

Performance-wise JdzGlobalAllocator works very well, starting to work on some more ambitious ideas to improve performance of the default JdzAllocator (which avoids requiring these calls).

ziglang / zig

implement a fast general purpose allocator #12484