yfractal / blog

10 stars 0 forks source link

JemMlloc #10

Closed yfractal closed 1 year ago

yfractal commented 1 year ago

Introduction

A memory allocator needs to suit hardware(multi-processor + cache line) and use memory efficiently(fast + less fragment).

By default, JeMalloc uses Thread-Specific-Data and mutiple arenas for every CPU for reducing lock contention.

For handling cache line related things, JeMalloc relies on multiple arenas and allows applications to pad the allocations.

JeMalloc allocates different size memory from different memory blocks(slab/extent) which can reduce fragmentation and locate memory quickly.

This article will introduce how JeMalloc 5.2.1[1] allocates and deallocates memory. And for simplicity, I will focus on small memory allocation and will not cover all details.

Big Picture -- The Hierarchy

We can image the memory managed in 3 levels: tcache, arena, and OS. It works like CPU cache -- memory will be allocated from a higher level, if the higher level doesn't have available memory, memory will be allocated from the next level and fill in the higher level, now the memory can be allocated.

Screen Shot 2022-09-25 at 11 31 51

tcache has many cache_bins, cache_bin's avail field will point to regions -- region is a unit of memory and will be returned when an application calls malloc.

arena has many extents/slabs, a slab will have many regions for small memory. When there is no available memory in a cache_bin, the cache_bin will try to get memory from extent/slab.bin through the current arena. If the arena doesn't have an available extent/slab too, JeMalloc will try to allocate memory from OS.

Screen Shot 2022-09-25 at 11 44 50

Now let's see how JeMalloc's every component works.

Small Memory Allocation

TCache

Screen Shot 2022-09-26 at 08 36 28

When an application allocates memory, JeMalloc will try to find available memory from tcache, tcache is thread-specific data, as the data is owned by a thread, we do not need to lock it when we use it. The thread-specific data is get by pthread_setspecific and set by pthread_getspecific.

Relate data structures are tsd, tcache and cache_bin. tsd is used for fast access tcache and others. tcache has many cache_bin, different cache_bin is used for allocating different size memory.

For example, tcache.bins_small[0] is used for allocating 0 ~ 8 bytes memory, tcache.bins_small[1] is for 9 ~ 16 bytes, the calculation is in sz_size2index_compute.

After finding the right cache_bin through size, we can find the available memory address through cache_bin.avail, it contains the same size memory(JeMalloc calls this region) addresses, the code is in cache_bin_alloc_easy.

If the cache_bin doesn't have available memory, JeMalloc needs to get memory from the next level -- arena, fill the cache_bin, and allocate the memory again. Simplified code as below:

JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_small(tsd_t *tsd, arena_t *arena, tcache_t *tcache,
                   size_t size, szind_t binind, bool zero, bool slow_path) {
    // ...
    bin = tcache_small_bin_get(tcache, binind);
    ret = cache_bin_alloc_easy(bin, &tcache_success);
    assert(tcache_success == (ret != NULL));
    if (unlikely(!tcache_success)) {
        bool tcache_hard_success;
        arena = arena_choose(tsd, arena);

        ret = tcache_alloc_small_hard(tsd_tsdn(tsd), arena, tcache,
                                      bin, binind, &tcache_hard_success);
    }
    // ...

    return ret;
}

Arena

Screen Shot 2022-09-26 at 08 37 57

arena is used for managing extent, extent is a unit of memory, it may contain multiple OS pages. extent is allocated from or deallocated to OS through OS provides system call such as mmap, relates code is in src/pages.c. And small size extent is called slab in JeMalloc.

Likes tcache, arena has many bin which is used for managing different size memory for the application. bin has a slab pointer( bin.slabcur) and a collections of non-full slab/extent(bin.slabs_nonfull).

When JeMalloc trying to find available memory in a bin, JeMalloc will try bin.slabcur first, if fails it will try to get a slab in bin.slabs_nonfull and assign it to the bin.slabcur.

And if bin.slabs_nonfull has no available extent/slab too, JeMalloc will try to find the memory in arena's extents(the order is extents_dirty -> extents_muzzy -> extents_retained -> extent_avail, extents_dirty, extents_muzzy and extents_retained are used for purging, will cover in other sections). If all those do not have available extents, JeMalloc will allocate memory from the OS.

Purging

When an application frees memory, JeMalloc needs to determine how to return the memory. JeMalloc will return memory to tcache first. When tcache is full, JeMalloc will return free memory to extent, and arena will handle how and when returning the unused extents to the system.

Application -> tcache -> extent

When an application frees a small memory, the trace will be je_free -> ifree -> idalloctm -> tcache_dalloc_small.

In tcache_dalloc_small, the deallocation relates code is:

JEMALLOC_ALWAYS_INLINE void
tcache_dalloc_small(tsd_t *tsd, tcache_t *tcache, void *ptr, szind_t binind,
                    bool slow_path) {
    // ...
    if (unlikely(!cache_bin_dalloc_easy(bin, bin_info, ptr))) {
        tcache_bin_flush_small(tsd, tcache, bin, binind,
                               (bin_info->ncached_max >> 1));
        bool ret = cache_bin_dalloc_easy(bin, bin_info, ptr);
        assert(ret);
    }
    // ...
}

cache_bin_dalloc_easy will retun memory to current cache_bin. If cache_bin is full (bin->ncached == bin_info->ncached_max), JeMalloc will call tcache_bin_flush_small.

In tcache_bin_flush_small, it will return half(bin_info->ncached_max >> 1 in tcache_dalloc_small) to extent, arena_dalloc_bin_locked_impl will do this work:

static void
arena_dalloc_bin_locked_impl(tsdn_t *tsdn, arena_t *arena, bin_t *bin,
                             szind_t binind, extent_t *slab, void *ptr, bool junked) {
    // ...
    if (nfree == bin_info->nregs) {
        arena_dissociate_bin_slab(arena, slab, bin);
        arena_dalloc_bin_slab(tsdn, arena, slab, bin);
    } else if (nfree == 1 && slab != bin->slabcur) {
        arena_bin_slabs_full_remove(arena, bin, slab);
        arena_bin_lower_slab(tsdn, arena, slab, bin);
    }
    // ...
}

If nfree == bin_info->nregs, it means slab is empty, so need to call arena_dissociate_bin_slab and then arena_dalloc_bin_slab. arena_dalloc_bin_slab will append the free slab(extent) to current arena->extents_dirty by calling arena_extents_dirty_dalloc.

extent -> system

JeMalloc is a user level memory allocator, we need to consider other applications' memory usage in the current system, so JeMalloc needs to return free memory back when necessary. And after an arena returns free memory to the system, other arenas can use it too.

For returning memory to system, we need to call a system call and in the system call, the system needs to zero the memory and mark the memory as unused. If we return too much memory at one step, it may hurt performance.

And when an extent is free, the application may use it in a very short time. For such a case, we'd better not return to the system too quickly.

JeMalloc uses "decay" to handle those things. We already know free slab(extent) is appended to arena->extents_dirty by arena_extents_dirty_dalloc. Extents in arena->extents_dirty will be moved to arena->extents_muzzy and the extents in arena->extents_muzzy will be returned to the system.

Let's see how the "decay" works and then see how the extents are returned from arena->extents_dirty to arena->extents_muzzy, and then returned to the system.

Decay

JeMalloc uses a ticker to count down how many times an arena allocated/deallocated memory. And when the ticker reaches 0, it will trigger arena_decay which will calculate how long passed since the last time and call arena_decay to do purging work when needed.

The ticker is initialized by ticker_init(&arenas_tdata[i].decay_ticker, DECAY_NTICKS_PER_UPDATE); in arena_tdata_get_hard, the value of DECAY_NTICKS_PER_UPDATE is 1000.

arena_decay_ticks is used for handling tick related thing and it will be called when memory is allocated or deallocated, such as by arena_malloc_small.

For arena_decay_ticks, it will call ticker_ticks(decay_ticker, nticks) for updating the count. After subtract nticks (ticker->tick -= nticks), if the ticker->tick is less than 0, JeMalloc will reset it to 0 and return true. Then arena_decay will be called.

JEMALLOC_ALWAYS_INLINE void
arena_decay_ticks(tsdn_t *tsdn, arena_t *arena, unsigned nticks) {
    // ...
    if (unlikely(ticker_ticks(decay_ticker, nticks))) {
        arena_decay(tsdn, arena, false, false);
    }
}

After we know how the arena_decay is trigged, let's see how arena_decay works.

For high level, JeMalloc wants to return memory back smoothly, it means when JeMalloc find n unused pages, instead of returning the n pages in one time, it will return them in serval seconds(default dirty decay is 10 seconds and stored in DIRTY_DECAY_MS_DEFAULT) and in several steps(default is 200 and it is in SMOOTHSTEP_NSTEPS.)

In struct arena_decay, backlog is used for recording how many unused dirty pages were generated during each of the past SMOOTHSTEP_NSTEPS decay epochs. nunpurged is used for recording the number of unpurged pages at beginning of current epoch.

The current epoch backlog is calculated by current_npages - decay->nunpurged and backlog saves the number in reverse order, so the current epoch number is stored in the last index, see arena_decay_backlog_update_last.

And when epochs passed, JeMalloc will move the backlog forward by memmove(decay->backlog, &decay->backlog[nadvance_z], (SMOOTHSTEP_NSTEPS - nadvance_z) * sizeof(size_t));, and as time may pass more than one epoch, needs set uncalculated backlog to 0 by memset(&decay->backlog[SMOOTHSTEP_NSTEPS - nadvance_z], 0, (nadvance_z-1) * sizeof(size_t));, all those are in arena_decay_backlog_update.

JeMalloc uses Smoothstep[2] function to calculate how many pages need to remain. For example, when x = 1, y = 1, that means no pages need to be purged, and when x = 0, y = 0, that means all the pages need to be purged.

220px-Smoothstep_and_Smootherstep svg

For avoiding calculation, JeMalloc generates those values by smoothstep.sh, and the value is encoded by binary fixed point representation and then they are stored in h_steps.

arena_decay_backlog_npages_limit is used for calculating how many pages need to remain.

static size_t
arena_decay_backlog_npages_limit(const arena_decay_t *decay) {
    // ...
    /*
     * For each element of decay_backlog, multiply by the corresponding
     * fixed-point smoothstep decay factor.  Sum the products, then divide
     * to round down to the nearest whole number of pages.
     */
    sum = 0;
    for (i = 0; i < SMOOTHSTEP_NSTEPS; i++) {
        sum += decay->backlog[i] * h_steps[i];
    }
    npages_limit_backlog = (size_t)(sum >> SMOOTHSTEP_BFP);

    return npages_limit_backlog;
}

dirty -> muzzy and muzzy -> system

Now let's see how the extents are moved from dirt tomuzzy and how muzzy returns them to the system.

arena_decay will handle dirty and muzzy extents by calling arena_decay_dirty and arena_decay_muzzy. arena_maybe_decay will do the decay job for arena_decay_dirty and arena_decay_muzzy.

In arena_maybe_decay, if decay->time_ms is 0, will call arena_decay_to_limit and the limit params is 0, that means deallocate all free extents.

And if decay->time_ms is not zero, will handle epoch advance related things, calculate limit by arena_decay_backlog_npages_limit(decay) and use the limit to call arena_decay_to_limit.

arena_decay_stashed will do purging work for arena_decay_to_limit.

In arena_decay_stashed, it will check ‘extents' state, if the state is extent_state_dirty, it will add the extents to arena->extents_muzzy, and if the state is extent_state_muzzy, the extent will return to system by extent_dalloc_wrapper.

Default dirty decay time 10 seconds(DIRTY_DECAY_MS_DEFAULT) and default muzzy decay time is 0(MUZZY_DECAY_MS_DEFAULT).

Others

Memory allocation is important for application performance, there are many different implementation[3][4][5][6] or designed for special purpose[7][8].

JeMalloc provides per CPU arena but this feature has been disabled by default.

TCMalloc uses per-thread mode first and it doesn't work well when an application has a large thread count. Then TCMalloc migrated to the per-CPU approach[8].

Slab allocator[5] is used for kernel object caching which has specific consideration about cache line usage. JeMalloc relies on multiple arenas and allows applications to pad the memory[3].

The cache line will be helpfull in some cases but not all cases. CPHash[9] does some comparison about this, CPHash is a hash table which designed for reducing cache missing. In its benchamrk, when the hash table's memory usage is bigger than L3 cache, the performance wil drop and is limitted by DRAM.

Erlang VM creates one thread for every CPU which is used for scheduling Erlang processes, so it's easier to do per-CPU related things. And Erlang has different allocators for different usages, such as ets_alloc for in-memory key-value storage, temp_alloc for temporary allocations, and it can specify different fit algorithms for different allocators[6].

MICA[7] is in-memory key-value storage. In its cache mode, memory management is very interesting. It organizes its memory as a fix-sized circular log, and key-value items will be appended to the tail of the log. When the circular log is full, items will be evicted from the head. So no need for garbage collection and defragmentation.

LAMA is designed for solving slab calcification[11], this problem may occur in the default Memcached server, Memcached allocates different size memory from the different slabs. The memory allocator needs to consider how to evict item(s) when no free memory. LAMA is designed for distributing memory to different slabs dynamically based on the access pattern for reducing the miss rate.

Summary

In this article, instead of describing the detail such as each structure's fields or how its extent rtree works, I focus on introducing the relationship between different structures and how those structures work together for allocating and deallocating memory. Hope it is helpful for people who want to understand how JeMalloc works.

References

  1. JeMalloc https://github.com/jemalloc/jemalloc/releases/tag/5.2.0
  2. Smoothstep https://en.wikipedia.org/wiki/Smoothstep
  3. A Scable Concurrent malloc(3) Implementation for FreeBSD
  4. TCMalloc https://github.com/google/tcmalloc
  5. The Slab Allocator: An Object-Caching Kernel Memory Allocator
  6. Erlang memory allocator https://github.com/erlang/otp/blob//OTP-25.1.1/erts/emulator/beam/erl_alloc.c
  7. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
  8. TCMalloc : Thread-Caching Malloc https://google.github.io/tcmalloc/design.html
  9. CPHash: A Cache-Partitioned Hash Table
  10. LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache
  11. Caching with twemcache. https://blog.twitter.com/2012/caching-with-twemcache
yfractal commented 1 year ago

Move to https://github.com/yfractal/blog/blob/master/blog/2022-10-05-jemalloc.md