[Feature request] merge methods

Enferlain commented 5 months ago

Thank you, this is a great addition for the merger space, but just wanted to ask if you thought about adding various merge_methods?

https://github.com/s1dlx/meh/blob/81515bda3cba52edd263964c5517f4713faad86e/sd_meh/merge_methods.py

Most up to date: https://github.com/ljleb/sd-mecha/blob/main/sd_mecha/merge_methods.py

wkpark commented 5 months ago

thank you for your information!

wkpark commented 4 months ago

Please see https://github.com/wkpark/sd-webui-model-mixer/discussions/35

wkpark commented 4 months ago

this routines are tensor level, it means it could be applied to the model-mixer easily. but also we need to check each algorthm and its meaning and it's speed.

ljleb commented 4 months ago

There is a more up to date set of merge method implementations: https://github.com/ljleb/sd-mecha/blob/main/sd_mecha/merge_methods.py

Everything marked as @convert_to_recipe is a merge method.

I can explain any/all of them if you want, as I came up with most of them. You can decide whether any is worth implementing. rotate is the slowest one, it takes ~1h on sdxl and ~9 minutes on sd1.5.

ljleb commented 4 months ago

All merge methods are key-level. I omitted weighted sum and add difference as they are trivial.

slerp(a, b): circular linear interpolation. normalize A and B, slerp, then recover a proper norm by interpolating the norm of A and the norm of B.
perpendicular_component(a, b): intended to be used in delta space: c + perpendicular_component(a - c, b - c). It finds the perpendicular component between A and B. This allows to add difference orthogonal information.
geometric_sum(a, b): AND gate. It works either in delta space or directly with model weights. It's equivalent to weighted sum in log space. For any corresponding parameters in A and B, if A or B is 0, then the result is 0; and if A and B have the same value, then that value is returned.
add_cosine_a(a, b), add_cosine_b(a, b): brought from supermerger, not exactly sure if they are good or not
ties_sum(a, b, c, ...): implementation of ties https://arxiv.org/abs/2306.01708
tensor_sum(a, b): copy parameters form A and from B using a window over dimension 0 to decide which model to pick the weights from. Brought from supermerger.
top_k_tensor_sum(a, b): reorder the parameters of A in the order of the parameters of B (call this reordered weight C). Then, determine a mask to pick the top k weights in A to give up on, and replace them with values from C at the corresponding indices in the weight.
train_difference(a, b, c): original supermerger train difference, except that I found a better filter metric. https://github.com/hako-mikan/sd-webui-supermerger/discussions/264#discussioncomment-8649950
multiply_quotient(a, b, c): train difference in log space. It tries to make the equation $\frac{AB}{C}$ work. Without the dissimilarity filter, it completely breaks down. Otherwise the resulting model gives very similar outputs to train difference, although it is still different. alpha can be brought up to 4 before it starts breaking down with NaNs at generation. @John-WL came up with the idea and I found a way to implement it.
distribution_crossover(a, b, c): reorder the weights of A and B into the order of C. Apply a crossover filter between A and B. A contributes the low end of the model, B contributes the high end. Then, reorder the merged weights back in the order of C.
crossover(a, b): n-dimensional crossover between A and B (in the case A and B are conv layers, the spatial dimensions stay on their axis). A contributes the low end, B contributes the high end. The weights are not reordered or reshaped. The filter should be isotropic, but honestly I'm not an expert in filter modelling so this might need to be verified.
rotate(a, b): find an orthogonal transform Q that minimizes the frobenius norm between AQ and B, then return $A^{'}Q^{\alpha}$ with $\alpha \in [0,1]$ the alignment factor. $A^{'}$ is the weighted sum between A and $BQ^T$, which effectively interpolates the relationship between the neurons of A and the neurons of B oriented towards A. Contrarily to other methods, this one works on the "neurons" of A and B. a "neuron" is just a quick way to talk about "all parameters that contribute to a single weighted sum operation during inference" (matrix multiplication can be seen as one weighted sum per output value). This is highly inspired by OPT https://opt-training.github.io/
clip(a, b): weights clipping, but allows to soften the clip bounds using many models.
dropout(a, b, c, ...): implementation of dare but with many models as input. I went a bit experimental with it by adding parameters to control the way the bernoulli mask is created.

I apologize if this is too much text to read. Let me know if I can clarify anything.

wkpark / sd-webui-model-mixer

[Feature request] merge methods #113