CLV Management / Saving mode

This PR introduces a new optional feature that allows the number of concurrently allocated CLV buffers to be set as low as log_2( #leaves ) + 2, compared to the usual one per inner node of the tree (or three per inner node in the case of epa-ng).

The main idea behind how this is implemented is that of actively managing this set of limited CLV slots, translating the usual clv_index into an internal slot_index, and deciding which slot should be overwritten if there aren't enough free slots available. This behaviour can be customized through the use of callback functions, collectively called the replacement strategy.

If/when you have time, please give some thought to what would still need to change for you to be able to use this in your code. This PR is primarily about feedback, though if we feel we can merge it safely that would be great fo course.

New structures

Like the repeats, the clv_manager is intended to be a feature with a self-contained struct that holds everything it needs:

typedef struct pll_clv_manager
{
  /**
   * Some upfront terminology:
   * - A slot is a buffer for one CLV that is held in memory
   * - the clv_index works like always, though now they function as
   *     "addressable" CLV indices
   * - a clv index that is slotted, means the CLV resides in memory
   * - a clv index that is pinned, means the CLV resides in memory and may not 
   *     be overwritten
   */

  size_t slottable_size; // max number of CLVs to hold in partition
  size_t addressable_begin; // first clv_index that is addressable
  size_t addressable_end; // one past last clv index that is addressable
  unsigned int * clvid_of_slot;
    // <slottable_size> entries, translates from slot_id to clv_index of node
    //  whos CLV is currently slotted here
    //  special value: PLL_CLV_SLOT_UNUSED if this slot isn't in use
  unsigned int * slot_of_clvid;
    // the reverse: indexed by clv_index, returns slot_id of a node
    // special value: PLL_CLV_CLV_UNSLOTTED if the node's clv isn't slotted 
    // currently
  bool * is_pinned;
    // tells if a given clv_index is marked as pinned
  size_t num_pinned;
  pll_uint_stack_t * unused_slots;
    // holds slot_id of slots that are not yet used
  pll_clv_manager_replace_cb strat_replace;
    // replacement strategy: replace function
  pll_clv_manager_update_cb strat_update_slot;
    // replacement strategy: update a slot with a clv_id function
  pll_clv_manager_dealloc_cb strat_data_dealloc;
    // replacement strategy: dealloc the custom data
  void* repl_strat_data;
    // void pointer to whatever data your replacement data might need
} pll_clv_manager_t;

Notes:

addressable_begin and ..._end is my way of dealing with the tipchar mode and its implication for what is a valid true clv_index
addressable_end equals tips + clv_buffers as specified during pll_partition_create (more on that later)
in most cases, unused_slots is kind a kind of uneccesary optimization, as usually you wouldn't explicitly mark slots as unused, so for those cases it only saves time in the very beginning, making getting an unused slot O(1) instead of O(#slots). It also complicates the codebase as I implemented a stack for it. However I think it should be left in as it may come in handy for non-standard use-cases and replacement strategies.

The callbacks can be set to custom functions to implement different behaviour for choosing which unpinned slot should be overwritten next, if there are no unused slots available. By default these are set to the minimum recomputation cost strategy, that aims to overwrite those CLV first that are easier to recompute if they are needed again.

Lifetime management

The manager struct is deallocated with the partition (dduring pll_partition_destroy). The CLV manager dealloc function tries to call the replacement-strategy deallocationc allback to deallocate any custom data used there.

Changes to existing code

I tried to keep these as limited, and as "zero overhead" as possible.

clv access

The biggest change is that, with the memsaver enabled these kind of accesses are wrong:

partition->clv[ node->clv_index ];

Instead, access to the clv buffer should now be done through these functions:

pll_get_clv_reading provides const/read-only access to a clv, NULL if the clv_index is not slotted. There is also pll_clv_is_slotted which can be used to check on whether a clv is slotted, such that the developer can possibly take action if not.
pll_get_clv_writing gives mutable access to a clv index, meaning that if the clv is not slotted it will give that clv a slot, either by assigning an unused slot or by replacing some slot that isn't pinned/ready for reuse.
there is no explicit read+write access function yet. Also, these functions are prime candidates for where thread-safety would come in.

pinning

In the existing code the access functions come into play, for example, in the call to the partials functions. There, we also encounter the second major (but small) change, as we have to explicitly pin the parent clv of the current to-be-updated partial, then do the normal update, followed by unpinning the two now no longer needed child-clvs:

first pin

then unpin after

changes to clv allocation

Like with the repeats mode, the allocation of the clv buffer is dependent on information we only get during the later initialization, namely how many slots we want. Consequently, the clv buffer is allocated after the partition creation. As this is shared between repeats and memory saver, I refactored this part into its own function to reduce redundancy.

other minor changes and additions

as we use C99, I changed pll_bool_t to use an actual bool. From what I saw it was only used once.
changed pll_utree_every to take a callback that can actually change the nodes it visits (I think there was no difference to pll_utree_every_const?). There should at least be some function allowing changes to the nodes to make it a useful apply-like function, and since there is already a const vs nonconst split here seems like an obvious change.
related to this, I added a pll_utree_foreach function that takes callbacks and data for those callbacks both for deciding if we should keep traversing, and for doing things to/from nodes. I found this super useful, only regret is the lack of const correctness with is mostly due to the exisitng const correctness problems in libpll.
I changed the test dir Makefile to be able to build in parallel (make -j) and made it explicitly use this libpll vs. possibly using some system wide version.
added memory manager to all tests where it makes sense (non-hardcoded input trees)

Usage

// first major change: we need information about the sizes of each subtree, both for the traversal
// later and for the default replacement strategy
auto subtree_sizes = pll_utree_get_subtree_sizes(tree);

auto attributes = simd_autodetect();
// enable the new mode
attributes |= PLL_ATTRIB_LIMIT_MEMORY;
// create the partition (note how nothing changes in this call)
auto partition = pll_partition_create(
    tips, // number of tips
    inner_nodes, // number of extra CLV buffers (one per inner in this case)
    ... );
const size_t low_clv_num = ceil(log2(tree->tip_count)) + 2;
pll_clv_manager_init(
    partition,
    concurrent_clvs, // !!! number of SLOTS we want !!!
    NULL, // slot replacement callback (NULL = default = MRC)
    NULL, // slot update callback
    NULL  // strategy data deallocation function
);
// since we are using the MRC strategy, we need to also initialize it
pll_clv_manager_MRC_strategy_init(
    partition->clv_man, // the CLV manager struct
    tree, // the utree
    subtree_sizes // a per-node_index array of cost to recompute. Here we use the size of the subtree
                         // starting at that node_index, toward the virtual root set in the tree
);

...

// later, when creating the operations for update_partials, we have the biggest change to normal code flow:
// we need to traverse the tree in a largest-subtree-first traversal
pll_utree_traverse_lsf(tree,
                      subtree_sizes,
                      PLL_TREE_TRAVERSE_POSTORDER,
                      cb_full_traversal,
                      travbuffer,
                      &traversal_size);

...

// create the operations array, update partials, etc

Notes:

pll_utree_traverse_lsf functions like the normal utree_traverse, except it takes in a utree instead of a unode. This is one of the things I'm totally open to change; it was more of a question of old vs. new style (and having the information about the tree being binary or not).
pll_utree_get_subtree_sizes internally uses a new function pll_utree_foreach which I guess has a lot of overlap with existing functions. We might want to consolidate those (coraxlib todo?)

Completeness TODOs

Some things I haven't been able to dive into fleshing out/implementing yet, that I would consider necessary for the manager to be "complete":

integration with repeats (this would have great synergy with the need to recompute)
making the manager thread safe
for now I've disallowed using memory saver wtihout the tipchar mode, but I think at least the groundwork is there (I actually have to go back and check why I disallowed that)

Other open questions

Everything is currently implemented on a per-partition basis, though there may be some overhead to be saved by consolodating some of it when using multiple partitions.
The "minimum recomputation cost" default replacement strategy may be complicating the default use-case due to having to supply a cost array. However this cost array can just equal the subtree_sizes array which is needed anyway to traverse the tree largest-subtree-first. Still, a smipler default may be desirable.

Here is a quick feedback before leaving for vacations ;-)

I think it's great that you managed to reduce as much as possible the changes in the existing code and to put most of the new stuff in one structure.
I appreciate the inline descriptions of the data members and the detailed description on how to adapt existing code + the example (maybe it would be worth adding it to the code samples?)
The PR provides a better encapsulation which is always good .
I think it's fine to have this on a per-partition basis. Or do you think the overhead of repeating the operations on all partitions will be expansive?
It would be really great to support repeats. But I guess it's also a bit more challenging. Maybe we can have a look together around mid January to do a proper encapsulation of both "modes".

I can't look at it in details now, but it looks great to me at a first glance ;-)

Looks very nice! After reading the description and without looking at the code yet, I have a couple of comments regarding potential raxml-ng integration:

combination with site repeats would be really important, because they do not only save memory but also improve performance, so right now it's a bit unclear how often pure CLV recomputation mode will "win"
thread safety issues can be avoided by replicating CLV manager for every thread (and partition), this of course means extra overhead but probably negligible in most cases (still, might be a reason for making "unnecessary" datastructs like unused_slots optional)
obviously, tree topology and thus subtree_sizes will change after every move, so there should be a way to reset it after CLV manager creation (maybe add subtree_sizes_valid flag to keep track?)
and since subtree_sizes is always needed for LSF traversal, maybe it should be encapsulated in CLV manager? (would probably simplify the interface a bit)

Here is a quick feedback before leaving for vacations ;-)

Thanks for taking the time, and for the kind words! :)

I appreciate the inline descriptions of the data members and the detailed description on how to adapt existing code + the example (maybe it would be worth adding it to the code samples?)

good point, will add a basic example. Whats already there is the tests, but they can be hard to read

I think it's fine to have this on a per-partition basis. Or do you think the overhead of repeating the operations on all partitions will be expansive?

I don't think it will be honestly. A lot of the operations have to be per-clv_buffer anyway, and the overhead regarding the subtree sizes, for example, can just be done once per tree.

It would be really great to support repeats. But I guess it's also a bit more challenging. Maybe we can have a look together around mid January to do a proper encapsulation of both "modes".

I agree, I would really like to have it in. In epa-ng it currently induces quite some overhead, especially when having to recompute all possible branches, so anything that can speed that up would be great.

combination with site repeats would be really important, because they do not only save memory but also improve performance, so right now it's a bit unclear how often pure CLV recomputation mode will "win"

agreed, its high on the list.

thread safety issues can be avoided by replicating CLV manager for every thread (and partition), this of course means extra overhead but probably negligible in most cases (still, might be a reason for making "unnecessary" datastructs like unused_slots optional)

Thats a good quick way around it, but probably we would want to experiment if introducing a mutex for every slot would be worth it. Either way it should be optional to minimize unneeded overhead for differing par. schemes/no par.

obviously, tree topology and thus subtree_sizes will change after every move, so there should be a way to reset it after CLV manager creation (maybe add subtree_sizes_valid flag to keep track?)

and since subtree_sizes is always needed for LSF traversal, maybe it should be encapsulated in CLV manager? (would probably simplify the interface a bit)

This is a good point; I had it a bit too separate in my mind: its needed for the MRC strategy, but there it's just one possible form of "cost". But more importantly since you absolutely need it in the lsf-traversal, I guess theres no reason not to have it. This way the MRC_strategy_init may even become totally unnecessary in the full-default case, if we force the user to always supply a {u|r}tree. The one slightly annoying thing is that the lsf-traversal function would have to retain the pointer to the subtree sizes array, just in case someone wants to use it without the memsaver. Small price to pay I'd say.

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

Does the stack structure need the bool empty? I think you can change the logic such that you don't need this.
Do you need recompacting functions? Or maybe other functions which report some statistics for the state of the managed clvs?
Does locality matter/how does this impact the locality of libpll? I know in the past that I have done some experiments that showed even in the worst case, the impact wasn't that bad, but I think that it is worth checking again.
I think that this shares a lot of challenges and design choices with a garbage collector. No question there, just pointing out that we might be able to learn something from them.

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

Does the stack structure need the bool empty? I think you can change the logic such that you don't need this.

True, and also safer.

Do you need recompacting functions? Or maybe other functions which report some statistics for the state of the managed clvs?

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

Does locality matter/how does this impact the locality of libpll? I know in the past that I have done some experiments that showed even in the worst case, the impact wasn't that bad, but I think that it is worth checking again.

Interesting point, perhaps some more advanced replacement strategy could prefer to put child and parent clvs in slots that are close to each other in terms of actual buffer address, since we are abstracting here already anyway. With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

I think that this shares a lot of challenges and design choices with a garbage collector. No question there, just pointing out that we might be able to learn something from them.

In a sense, yes, though the goal isn't to deallocate the buffers of the slots, but rather have the program never exceed some amount of memory. By the way, in the epa-ng codebase I have a bunch of functions to gauge the memory footprint of a partition/etc. maybe these could be useful in the library at some point, implemented in a more "plugged in" way (hand in hand with a refactor of pll_partition_create).

Again, thanks for the comments!

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

I mean some function that just swaps the allocations such that we obtain a "better"[1] layout. To be honest, I was kinda just thinking in writing there, and that thought was the "proto"thought to the locality question. Basically, the motivation is that if you find that locality matters, then we might want to, occasionally, reallocate in a way that improves the locality.

With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

This is half true? because the CLV is actually as long as the alignment (except in site repeats), so the length usually saves us from any locality problems. But I think EPA-NG often has really short alignments, so maybe this is a bigger factor? I did the experiment with 1000 sites, and didn't try different numbers. I think that if you have time this should be looked into.

[1]: better is left as an exercise to the reader

xflouris / libpll-2