Transformer实验 | J球星的博客

weiren1998 commented 2 years ago

https://weiren1998.github.io/archives/697052e2.html#more

在探索网络架构的过程中，需要做很多尝试和思考，同时也需要把实验数据和对于结果的思考等记录下来，从而一点点积累感觉

最近在做本科的毕业设计，题目是足球视频中的行为关键帧检测算法设计。在实验过程中，发现有很多细小的想法，但有时一晃而过，可能是一些小的尝试，但很少会做对比试验，因此将这些想法记录下来，方便之后再做改进。

weiren1998 commented 2 years ago

细品BN的中间输出值，将其可视化

import cv2
import numpy as np
img = x.cpu().numpy() # 将torch变为numpy
img = (img - np.min(img))/(np.max(img) - np.min(img)) *255.0 # 归一化*255
img = img.astype(int) # 转为int
cv2.imwrite('./100_msa.jpg', img[100]) # 将一个feature输出

weiren1998 commented 2 years ago

1. load中间模型，可视化中间层结果

结论：

LN输出的feature map的均值比较小，方差比较大。方差大的原因是有个别channel存在极大值和极小值

BN输出的feature map的均值比较小，方差比较大。方差大的原因是每个所有特征的值差异都比较大，没有规律

LN除去极大值和极小值的channel外，其他channel相对来说比较接近，同时随深度越深，这种趋势越明显

MLP比MSA层的变化的更大，这也是在MLP中加入BN可以稳定训练的原因之一

先看一下可以完成训练的ln模型的feature map：ln的feature map中C的某个维度有极大值和极小值，原因是ln中有weight过大和过小

# base ln epoch 50 layer1
(Pdb) self.norm1.weight.min()
tensor(-0.0013, device='cuda:0')
(Pdb) self.norm1.weight.max()
tensor(0.3647, device='cuda:0')
(Pdb) self.norm2.weight.max()
tensor(1.5141, device='cuda:0')
(Pdb) self.norm2.weight.min()
tensor(0.0197, device='cuda:0')

Model	BN	ImNet top-1 (%)	Para (M)	Throughput (img/s)	Embed dim	#heads	#depth
Deit-Ti	×	72.56 (72.2)	5	2536.5	192	3	12
Deit-S	×	80.47 (79.8)	22	940.4	384	6	12
Deit-B	×	78.74 (81.8)	86	292.3	768	12	12
Deit-L	×	77.83	153.2		1024	16	12
Deit-Ti-bn	√	70.26	5.7		192	3	12
Deit-S-bn	√	77.14 (236 epo)	22.1	237epoch acc掉的厉害，看看原因	384	6	12
Deit-B-bn	√	48.14 (13 epo)	86.6		768	12	12
Deit-L-bn	√	41.88 (9 epo)	153.2		1024	16	12

2. 测试极大值对于bn和ln的影响

当所有patch的同一个channel数值很大时，过BN后同一个patch的不同channel之间会比较接近，均接近0；但过LN后同一个patch的那一个channel的值依旧很大，且所有patch的那个channel数值都很大
当所有batch中，一个patch维度的所有channel值比其他所有patch的cahnnel值都大很多时，过BN后，该patch的所有channel还是会很大；LN会让他们变得很小

weiren1998 commented 2 years ago

3. 尝试训练Deit_bn模型，使得训练稳定

3.1 将activation function换为relu6

动机：正常训练时梯度会爆炸，因此将FFN中的激活层变为relu6，减小激活层的输出值，防止正向传播时特征爆炸，同时可以把较大的x对应的梯度降为0

3.2 加入Weight Norm机制

动机：weight norm貌似可以在CNN中稳定训练（具体可以看看论文）

3.3 batchsize的大小与BN的稳定性好像也有关

Deit-L_bn_wn模型，当batchsize为128时，模型在第12个epoch就NaN了，而当把batchsize调整为32后，模型就可训练了
减小batchsize好像可以使BN训练更稳定，有必要在这个方向继续挖么？

No.	Model	BN	ImNet top-1 (%)	Para (M)	Throughput (img/s)	Embed dim	#heads	#depth
	Deit-Ti	×	72.56 (72.2)	5	2536.5	192	3	12
	Deit-S	×	80.47 (79.8)	22	940.4	384	6	12
	Deit-B	×	78.74 (81.8)	86.6	292.3	768	12	12
	Deit-L	×	77.83	153.2		1024	16	12
1	Deit-B-bn_relu6	√	79.26	86.6		768	12	12
2	Deit-L-bn_relu6	√	NaN (35 epo)	153.2	看会不会崩
3	Deit-B-bn_wn	√	78.77 priv	86.7	看wn模块是否有用	768	12	12
4	Deit-L-bn_wn	√	BJ ing	153.3	看看wn会不会崩的快
5	Deit-B-bn_wn_relu6	√	80.25 SH BJ ing	86.7	看relu6和wn一起是否有好的效果
6	Deit-B_ln_bn_wn	√	NaN (80.7 234 epo)	86.7	看LN+BN+WN能不能训练
7	Deit-B_ln_bn_relu6	√	NaN (8 epo)	86.7	查找NaN原因
8	Deit-B_ln_bn_wn_relu6	√	NaN (77.65 172 epo)	6.7	看LN+BN+WN+relu6能不能训练
9	Swin-S_bn_wn_relu6	√	ipt ing	49.7	看是否适用于swin	batch 256
10	Swin-T_bn_wn_relu6	√	80.75 ipt (81.3)	28.3	准备在检测任务上看效果	batch 256

setting6和setting8比较：加入relu6后训练速度变快了，但是精度好像有点问题，
setting1和setting3比较：relu6和wn机制都可以让训练过程更加稳定，但relu6可以提高精度，wn维持原先水平

4. 尝试结合LN和BN ×

目前ln+bn+wn这个机制可以训练，就是不知道能不能提点

ln+bn+wn+relu6可以帮助加速训练，不知道能不能提点

No.	Model	T_train	ImNet top-1 (%)	Para (M)
1	Deit-T_ln_bn_wn	69.87	BJ ing	5.7
2	Deit-T_ln_bn_wn_relu6	69.16	BJ ing	5.7
3	Deit-S_ln_bn_wn	78.76	BJ ing	22.1
4	Deit-S_ln_bn_wn_relu6	78.24	BJ ing
5	Deit-B_ln_bn_wn	NaN (80.70 234 epo)	上setting6
6	Deit-B_ln_bn_wn_relu6	NaN (77.65 172 epo)	上setting8

结论：
1. ln+bn+wn机制在小模型上可以完成训练，但大模型上训练不稳定
2. ln+bn+wn随着模型的增大，和原始模型acc的差距越来越小，最小的差距大概是1.1%
3. ln+bn+wn+relu确实训练提速了一点，但是精度掉的也比较厉害，同时训练会有一些不稳定

5. 论文需要的基本实验

No.	Model	T_train	ImNet top-1 (%)	Para (M)
1	Deit-T_bn_relu6	√	BJ ing	5.7
2	Deit-T_bn_wn	√	BJ ing	5.7
3	Deit-T_bn_wn_relu6	√	priv BJ ing	5.7
4	Deit-S_bn_relu6	√	ipt BJ ing	22.1
5	Deit-S_bn_wn	√	ipt BJ ing	22.1
6	Deit-S_bn_wn_relu6	√	priv BJ ing	22.1
7	Swin-T_bn_relu6	√	BJ ing	28.3
8	Swin-T_bn_wn	√	BJ ing	28.3
9	Swin-T_bn_wn_relu6	√	80.75 ipt (81.3)	28.3
11	Swin-S_bn_relu6	√	ipt BJ ing	49.6
12	Swin-S_bn_wn	√	ipt BJ ing	49.7
13	Swin-S_bn_wn_relu6	√	priv SH ing	49.7

6. 测试推理速度

No.	Model	batchsize
1	Deit-T	256?
2	Deit-T_bn_wn_relu6	256?
3	Deit-B	256?
4	Deit-B_bn_wn_relu6	256?
5	Swin-T	256?
6	Swin-T_bn_wn_relu6	256?
7	Swin-B	256?
8	Swin-B_bn_wn_relu6	256?

7. swin模型在检测和分割的表现

No.	Backbone	ImgNet top1	ImgNet top5	COCO APbox	COCO APmask	ADE20k mIoU
1	Swin-T	81.3	95.6	50.5	43.5	46.1
2	Swin-T_bn_wn_relu6

6. 用Deit原始的轮子跑新模型看结果

7. TODO

将LN-BN模型加入WN (-+relu6) 来实验 √
测试推理速度，测试推理时的throughput，比较Deit-T和Deit-T_bn_wn_relu6
测试Swin-T_bn_relu6和Swin-T_bn_wn在下游任务（检测&分割）上的性能
探究ln-bn-relu6模型为何会崩掉，relu6对于稳定训练的帮助到底是什么
Deit-B的结果比论文低很多，用Deit原始的轮子训练看所有模型的结果

weiren1998 commented 2 years ago

Transformer-based architectures have achieved great success in Natural Language Processing (NLP) and Computer vision (CV). It was first proposed in the NLP tasks. In previous Convolutional Neural Networks (CNNs), the Batch Normalization (BN) is a widely-adopted method to reduce internal covariate shift and improve the generalization of the neural networks. While the Layer Normalization (LN) is commonly used in Recurrent Neural Networks (RNNs) and Transformers. When the Transformer was applied to CV tasks, it simply follow the original setting in NLP tasks by using LN because of the unstable training with BN. However, BN may be more proper to handle the image data and it could accelerate the inference process by using the moving average and variance. In this paper, we figure out that BN will make the variance between channels become very large, so we try to replace the original GeLU with a truncated ReLU6, and we also use the Weight Normalization (WN) to make the training more stable. Our experiments show that the BN-based Transformer model could have a comparable result with the LN-based model, and the BN-based model could speed up inference time by 12%.

weiren1998 / weiren1998.github.io