whyNLP / LCKV

Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance. Accepted to ACL 2024.
https://arxiv.org/abs/2405.10637
139 stars 6 forks source link

Merge codes for better ways of initializing weights #8

Closed why-in-Shanghaitech closed 1 month ago

why-in-Shanghaitech commented 1 month ago

MLKV proposed a more efficient method to initialize the model weights from a pre-trained model. Our experiments show that it is effective to LCKV.