zju-pi / Knowledge-Distillation-Paper

This resposity maintains a collection of important papers on knowledge distillation (awesome-knowledge-distillation)).
71 stars 14 forks source link

Missing paper #1

Closed wutaiqiang closed 6 months ago

wutaiqiang commented 6 months ago

Great work.

I would like to introduce two papers:

Name: Weight-Inherited Distillation for Task-Agnostic BERT Compression paper: code: https://github.com/wutaiqiang/WID-NAACL2024 Blog: https://zhuanlan.zhihu.com/p/687294843 TL, DR: 使用权重继承的思路来实现模型压缩, 直接学习一个映射,将教师模型的权重映射到学生模型。

Name: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models paper: https://arxiv.org/abs/2404.02657 Blog: https://zhuanlan.zhihu.com/p/690748958 TL, DR: 之前的论文认为对于LLM的蒸馏 RKL会更好,因为RKL是mode-seeking同时FKL是mean-seeking,本文证明这种claim是错误的,并做出了改进。

Thanks~

DefangChen commented 6 months ago

Thanks for your contribution! I will check these two papers later (perhaps next week)

wutaiqiang commented 6 months ago

Thanks for your time~ Really appreciate it

DefangChen commented 6 months ago

I have skimmed through these two papers. Here are my brief comments:

  1. I like the concept of weight-inherited distillation. The weights in a neural network should also embody another form of knowledge (besides the input-output function mapping). This claim has been mentioned in Hinton’s original KD paper and empirically validated in https://arxiv.org/abs/2203.14001. BTW, a simple technique to address the dimensional mismatch issue is matrix factorization (e.g., projecting the student feature onto the teacher feature using a linear mapping, then combining the parameters of the projector with the teacher matrix).
  2. The discussion and analysis of FKL and RKL are widely known and easily accessible in standard machine learning textbooks or on the internet. I believe that the title of Section 3 should be changed to "Background" and some proper references should be added (even if it maybe derived by yourself). I did not read the LLMs+KD papers you mentioned, and I am somewhat surprised that they contain such obvious mistakes according to your comments.
wutaiqiang commented 6 months ago

Thanks for your kind reply.

For the first paper, there is one following paper using a similar idea: https://openreview.net/forum?id=cNajYNQcj4.

For the second one, we argue that the mean-seeking and mode-seeking behaviors do not hold in KD for LLMs, while previous papers claim that two behaviors exists~(Please read the Zhihu blog for more details). In short, mean-seeking and mode-seeking behaviors require the student to follow the gaussian distritbution.

You mean that the FKL consider the head part as prior while RKL consider the tail part as piror in KD are widely known and easily accessible in standard machine learning textbooks or on the internet? I wonder is there any material or paper? Please tell me if you know and I would cite them. Nevertheless, I agree that the proveness is not diffucult but the cases are that previous papers contain such obvious mistakes.

wutaiqiang commented 6 months ago

关于第二点,我再用中文补充一下。现有的资料基本都是说 前向KL会尽可能同时拟合多个峰,反向KL倾向于拟合单个峰。这件事是没错的,确实如此,但是其实有个条件就是学生分布是单峰高斯以及教师是多峰高斯。

之前的一些论文,直接搬用这个观点,直接在论文里用这个结论,然后说RKL比FKL更好。其实这种假设就不成立。 Rethinking 这篇论文本质上是回应这种观点,指出在LLM的KD下,不能套用这个结论,FKL和RKL的优化目标都是二个概率互相重合(这倒也是符合直觉)。二者的区别在于拟合过程中侧重不同。诚然,证明并不复杂。f散度相关的资料在维基百科都能找到不少。不过直接点出这一点的,我倒是第一次读到。如果有相同的观点的博客或者论文,烦请共享一下,非常乐意后续版本加上引用。证明过程是不难的,但论文也不是非得要难的大家都看不懂的公式就是好,能够rethinking大家”习以为常“的误区,我觉得倒也不是坏事。

总的来说,还是感谢您的意见~~

DefangChen commented 6 months ago

“You mean that the FKL consider the head part as prior while RKL consider the tail part as piror in KD are widely known and easily accessible in standard machine learning textbooks or on the internet?”

No. I mean all the derivations in Sec 3 are standard and well-known.

DefangChen commented 6 months ago

I did not mean to challenge the value of your paper. I just point out some facts.

wutaiqiang commented 6 months ago

No offense. Sincerely thank you for your advice.

The part Deeper Insights can be found in wiki but Difference between FKL and RKL are novel. Since the goal is to show that mean-seeking and mode-seeking behaviors do not hold in KD for LLMs. So, some "easy" things are needed to help the readers understand. btw, the title for section is Preliminary and Rethinking. I think that Preliminary is similar to Background.

Thanks again~~