xflouris / libpll-2

Phylogenetic Likelihood Library 2 - experimental
GNU Affero General Public License v3.0
9 stars 12 forks source link

CLV Management / Saving mode #19

Open pierrebarbera opened 3 years ago

pierrebarbera commented 3 years ago

This PR introduces a new optional feature that allows the number of concurrently allocated CLV buffers to be set as low as log_2( #leaves ) + 2, compared to the usual one per inner node of the tree (or three per inner node in the case of epa-ng).

The main idea behind how this is implemented is that of actively managing this set of limited CLV slots, translating the usual clv_index into an internal slot_index, and deciding which slot should be overwritten if there aren't enough free slots available. This behaviour can be customized through the use of callback functions, collectively called the replacement strategy.

If/when you have time, please give some thought to what would still need to change for you to be able to use this in your code. This PR is primarily about feedback, though if we feel we can merge it safely that would be great fo course.

New structures

Like the repeats, the clv_manager is intended to be a feature with a self-contained struct that holds everything it needs:

typedef struct pll_clv_manager
{
  /**
   * Some upfront terminology:
   * - A slot is a buffer for one CLV that is held in memory
   * - the clv_index works like always, though now they function as
   *     "addressable" CLV indices
   * - a clv index that is slotted, means the CLV resides in memory
   * - a clv index that is pinned, means the CLV resides in memory and may not 
   *     be overwritten
   */

  size_t slottable_size; // max number of CLVs to hold in partition
  size_t addressable_begin; // first clv_index that is addressable
  size_t addressable_end; // one past last clv index that is addressable
  unsigned int * clvid_of_slot;
    // <slottable_size> entries, translates from slot_id to clv_index of node
    //  whos CLV is currently slotted here
    //  special value: PLL_CLV_SLOT_UNUSED if this slot isn't in use
  unsigned int * slot_of_clvid;
    // the reverse: indexed by clv_index, returns slot_id of a node
    // special value: PLL_CLV_CLV_UNSLOTTED if the node's clv isn't slotted 
    // currently
  bool * is_pinned;
    // tells if a given clv_index is marked as pinned
  size_t num_pinned;
  pll_uint_stack_t * unused_slots;
    // holds slot_id of slots that are not yet used
  pll_clv_manager_replace_cb strat_replace;
    // replacement strategy: replace function
  pll_clv_manager_update_cb strat_update_slot;
    // replacement strategy: update a slot with a clv_id function
  pll_clv_manager_dealloc_cb strat_data_dealloc;
    // replacement strategy: dealloc the custom data
  void* repl_strat_data;
    // void pointer to whatever data your replacement data might need
} pll_clv_manager_t;

Notes:

The callbacks can be set to custom functions to implement different behaviour for choosing which unpinned slot should be overwritten next, if there are no unused slots available. By default these are set to the minimum recomputation cost strategy, that aims to overwrite those CLV first that are easier to recompute if they are needed again.

Lifetime management

The manager struct is deallocated with the partition (dduring pll_partition_destroy). The CLV manager dealloc function tries to call the replacement-strategy deallocationc allback to deallocate any custom data used there.

Changes to existing code

I tried to keep these as limited, and as "zero overhead" as possible.

clv access

The biggest change is that, with the memsaver enabled these kind of accesses are wrong:

partition->clv[ node->clv_index ];

Instead, access to the clv buffer should now be done through these functions:

pinning

In the existing code the access functions come into play, for example, in the call to the partials functions. There, we also encounter the second major (but small) change, as we have to explicitly pin the parent clv of the current to-be-updated partial, then do the normal update, followed by unpinning the two now no longer needed child-clvs:

first pin

then unpin after

changes to clv allocation

Like with the repeats mode, the allocation of the clv buffer is dependent on information we only get during the later initialization, namely how many slots we want. Consequently, the clv buffer is allocated after the partition creation. As this is shared between repeats and memory saver, I refactored this part into its own function to reduce redundancy.

other minor changes and additions

Usage

// first major change: we need information about the sizes of each subtree, both for the traversal
// later and for the default replacement strategy
auto subtree_sizes = pll_utree_get_subtree_sizes(tree);

auto attributes = simd_autodetect();
// enable the new mode
attributes |= PLL_ATTRIB_LIMIT_MEMORY;
// create the partition (note how nothing changes in this call)
auto partition = pll_partition_create(
    tips, // number of tips
    inner_nodes, // number of extra CLV buffers (one per inner in this case)
    ... );
const size_t low_clv_num = ceil(log2(tree->tip_count)) + 2;
pll_clv_manager_init(
    partition,
    concurrent_clvs, // !!! number of SLOTS we want !!!
    NULL, // slot replacement callback (NULL = default = MRC)
    NULL, // slot update callback
    NULL  // strategy data deallocation function
);
// since we are using the MRC strategy, we need to also initialize it
pll_clv_manager_MRC_strategy_init(
    partition->clv_man, // the CLV manager struct
    tree, // the utree
    subtree_sizes // a per-node_index array of cost to recompute. Here we use the size of the subtree
                         // starting at that node_index, toward the virtual root set in the tree
);

...

// later, when creating the operations for update_partials, we have the biggest change to normal code flow:
// we need to traverse the tree in a largest-subtree-first traversal
pll_utree_traverse_lsf(tree,
                      subtree_sizes,
                      PLL_TREE_TRAVERSE_POSTORDER,
                      cb_full_traversal,
                      travbuffer,
                      &traversal_size);

...

// create the operations array, update partials, etc

Notes:

Completeness TODOs

Some things I haven't been able to dive into fleshing out/implementing yet, that I would consider necessary for the manager to be "complete":

Other open questions

BenoitMorel commented 3 years ago

Here is a quick feedback before leaving for vacations ;-)

I can't look at it in details now, but it looks great to me at a first glance ;-)

amkozlov commented 3 years ago

Looks very nice! After reading the description and without looking at the code yet, I have a couple of comments regarding potential raxml-ng integration:

pierrebarbera commented 3 years ago

Here is a quick feedback before leaving for vacations ;-)

Thanks for taking the time, and for the kind words! :)

  • I appreciate the inline descriptions of the data members and the detailed description on how to adapt existing code + the example (maybe it would be worth adding it to the code samples?)

good point, will add a basic example. Whats already there is the tests, but they can be hard to read

  • I think it's fine to have this on a per-partition basis. Or do you think the overhead of repeating the operations on all partitions will be expansive?

I don't think it will be honestly. A lot of the operations have to be per-clv_buffer anyway, and the overhead regarding the subtree sizes, for example, can just be done once per tree.

  • It would be really great to support repeats. But I guess it's also a bit more challenging. Maybe we can have a look together around mid January to do a proper encapsulation of both "modes".

I agree, I would really like to have it in. In epa-ng it currently induces quite some overhead, especially when having to recompute all possible branches, so anything that can speed that up would be great.

pierrebarbera commented 3 years ago
  • combination with site repeats would be really important, because they do not only save memory but also improve performance, so right now it's a bit unclear how often pure CLV recomputation mode will "win"

agreed, its high on the list.

  • thread safety issues can be avoided by replicating CLV manager for every thread (and partition), this of course means extra overhead but probably negligible in most cases (still, might be a reason for making "unnecessary" datastructs like unused_slots optional)

Thats a good quick way around it, but probably we would want to experiment if introducing a mutex for every slot would be worth it. Either way it should be optional to minimize unneeded overhead for differing par. schemes/no par.

  • obviously, tree topology and thus subtree_sizes will change after every move, so there should be a way to reset it after CLV manager creation (maybe add subtree_sizes_valid flag to keep track?)
  • and since subtree_sizes is always needed for LSF traversal, maybe it should be encapsulated in CLV manager? (would probably simplify the interface a bit)

This is a good point; I had it a bit too separate in my mind: its needed for the MRC strategy, but there it's just one possible form of "cost". But more importantly since you absolutely need it in the lsf-traversal, I guess theres no reason not to have it. This way the MRC_strategy_init may even become totally unnecessary in the full-default case, if we force the user to always supply a {u|r}tree. The one slightly annoying thing is that the lsf-traversal function would have to retain the pointer to the subtree sizes array, just in case someone wants to use it without the memsaver. Small price to pay I'd say.

computations commented 3 years ago

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

pierrebarbera commented 3 years ago

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

  • Does the stack structure need the bool empty? I think you can change the logic such that you don't need this.

True, and also safer.

  • Do you need recompacting functions? Or maybe other functions which report some statistics for the state of the managed clvs?

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

  • Does locality matter/how does this impact the locality of libpll? I know in the past that I have done some experiments that showed even in the worst case, the impact wasn't that bad, but I think that it is worth checking again.

Interesting point, perhaps some more advanced replacement strategy could prefer to put child and parent clvs in slots that are close to each other in terms of actual buffer address, since we are abstracting here already anyway. With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

  • I think that this shares a lot of challenges and design choices with a garbage collector. No question there, just pointing out that we might be able to learn something from them.

In a sense, yes, though the goal isn't to deallocate the buffers of the slots, but rather have the program never exceed some amount of memory. By the way, in the epa-ng codebase I have a bunch of functions to gauge the memory footprint of a partition/etc. maybe these could be useful in the library at some point, implemented in a more "plugged in" way (hand in hand with a refactor of pll_partition_create).

Again, thanks for the comments!

computations commented 3 years ago

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

I mean some function that just swaps the allocations such that we obtain a "better"[1] layout. To be honest, I was kinda just thinking in writing there, and that thought was the "proto"thought to the locality question. Basically, the motivation is that if you find that locality matters, then we might want to, occasionally, reallocate in a way that improves the locality.

With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

This is half true? because the CLV is actually as long as the alignment (except in site repeats), so the length usually saves us from any locality problems. But I think EPA-NG often has really short alignments, so maybe this is a bigger factor? I did the experiment with 1000 sites, and didn't try different numbers. I think that if you have time this should be looked into.

[1]: better is left as an exercise to the reader