weiren1998 / weiren1998.github.io

This is my blog.
1 stars 0 forks source link

Transformer实验 | J球星的博客 #13

Open weiren1998 opened 2 years ago

weiren1998 commented 2 years ago

https://weiren1998.github.io/archives/697052e2.html#more

在探索网络架构的过程中,需要做很多尝试和思考,同时也需要把实验数据和对于结果的思考等记录下来,从而一点点积累感觉

最近在做本科的毕业设计,题目是足球视频中的行为关键帧检测算法设计。在实验过程中,发现有很多细小的想法,但有时一晃而过,可能是一些小的尝试,但很少会做对比试验,因此将这些想法记录下来,方便之后再做改进。

weiren1998 commented 2 years ago

细品BN的中间输出值,将其可视化

import cv2
import numpy as np
img = x.cpu().numpy() # 将torch变为numpy
img = (img - np.min(img))/(np.max(img) - np.min(img)) *255.0 # 归一化*255
img = img.astype(int) # 转为int
cv2.imwrite('./100_msa.jpg', img[100]) # 将一个feature输出
weiren1998 commented 2 years ago

1. load中间模型,可视化中间层结果

结论:

  1. LN输出的feature map的均值比较小,方差比较大。方差大的原因是有个别channel存在极大值和极小值
  2. BN输出的feature map的均值比较小,方差比较大。方差大的原因是每个所有特征的值差异都比较大,没有规律
  3. LN除去极大值和极小值的channel外,其他channel相对来说比较接近,同时随深度越深,这种趋势越明显
  4. MLP比MSA层的变化的更大,这也是在MLP中加入BN可以稳定训练的原因之一

2. 测试极大值对于bn和ln的影响

weiren1998 commented 2 years ago

3. 尝试训练Deit_bn模型,使得训练稳定

3.1 将activation function换为relu6
  1. 动机:正常训练时梯度会爆炸,因此将FFN中的激活层变为relu6,减小激活层的输出值,防止正向传播时特征爆炸,同时可以把较大的x对应的梯度降为0
3.2 加入Weight Norm机制
  1. 动机:weight norm貌似可以在CNN中稳定训练(具体可以看看论文)
3.3 batchsize的大小与BN的稳定性好像也有关
  1. Deit-L_bn_wn模型,当batchsize为128时,模型在第12个epoch就NaN了,而当把batchsize调整为32后,模型就可训练了
  2. 减小batchsize好像可以使BN训练更稳定,有必要在这个方向继续挖么?
No. Model BN ImNet top-1 (%) Para (M) Throughput (img/s) Embed dim #heads #depth
Deit-Ti × 72.56 (72.2) 5 2536.5 192 3 12
Deit-S × 80.47 (79.8) 22 940.4 384 6 12
Deit-B × 78.74 (81.8) 86.6 292.3 768 12 12
Deit-L × 77.83 153.2 1024 16 12
1 Deit-B-bn_relu6 79.26 86.6 768 12 12
2 Deit-L-bn_relu6 NaN (35 epo) 153.2 看会不会崩
3 Deit-B-bn_wn 78.77 priv 86.7 看wn模块是否有用 768 12 12
4 Deit-L-bn_wn BJ ing 153.3 看看wn会不会崩的快
5 Deit-B-bn_wn_relu6 80.25 SH BJ ing 86.7 看relu6和wn一起是否有好的效果
6 Deit-B_ln_bn_wn NaN (80.7 234 epo) 86.7 看LN+BN+WN能不能训练
7 Deit-B_ln_bn_relu6 NaN (8 epo) 86.7 查找NaN原因
8 Deit-B_ln_bn_wn_relu6 NaN (77.65 172 epo) 6.7 看LN+BN+WN+relu6能不能训练
9 Swin-S_bn_wn_relu6 ipt ing 49.7 看是否适用于swin batch 256
10 Swin-T_bn_wn_relu6 80.75 ipt (81.3) 28.3 准备在检测任务上看效果 batch 256

4. 尝试结合LN和BN ×

目前ln+bn+wn这个机制可以训练,就是不知道能不能提点

ln+bn+wn+relu6可以帮助加速训练,不知道能不能提点

No. Model T_train ImNet top-1 (%) Para (M) Throughput (img/s)
1 Deit-T_ln_bn_wn 69.87 BJ ing 5.7
2 Deit-T_ln_bn_wn_relu6 69.16 BJ ing 5.7
3 Deit-S_ln_bn_wn 78.76 BJ ing 22.1
4 Deit-S_ln_bn_wn_relu6 78.24 BJ ing
5 Deit-B_ln_bn_wn NaN (80.70 234 epo) 上setting6
6 Deit-B_ln_bn_wn_relu6 NaN (77.65 172 epo) 上setting8

5. 论文需要的基本实验

No. Model T_train ImNet top-1 (%) Para (M) Throughput (img/s)
1 Deit-T_bn_relu6 BJ ing 5.7
2 Deit-T_bn_wn BJ ing 5.7
3 Deit-T_bn_wn_relu6 priv BJ ing 5.7
4 Deit-S_bn_relu6 ipt BJ ing 22.1
5 Deit-S_bn_wn ipt BJ ing 22.1
6 Deit-S_bn_wn_relu6 priv BJ ing 22.1
7 Swin-T_bn_relu6 BJ ing 28.3
8 Swin-T_bn_wn BJ ing 28.3
9 Swin-T_bn_wn_relu6 80.75 ipt (81.3) 28.3
11 Swin-S_bn_relu6 ipt BJ ing 49.6
12 Swin-S_bn_wn ipt BJ ing 49.7
13 Swin-S_bn_wn_relu6 priv SH ing 49.7

6. 测试推理速度

No. Model ImNet top-1 (%) batchsize Throughput (img/s) GFLOPs
1 Deit-T 256?
2 Deit-T_bn_wn_relu6 256?
3 Deit-B 256?
4 Deit-B_bn_wn_relu6 256?
5 Swin-T 256?
6 Swin-T_bn_wn_relu6 256?
7 Swin-B 256?
8 Swin-B_bn_wn_relu6 256?

7. swin模型在检测和分割的表现

No. Backbone ImgNet top1 ImgNet top5 COCO APbox COCO APmask ADE20k mIoU
1 Swin-T 81.3 95.6 50.5 43.5 46.1
2 Swin-T_bn_wn_relu6

6. 用Deit原始的轮子跑新模型看结果

7. TODO

  1. 将LN-BN模型加入WN (-+relu6) 来实验 √

  2. 测试推理速度,测试推理时的throughput,比较Deit-T和Deit-T_bn_wn_relu6

  3. 测试Swin-T_bn_relu6和Swin-T_bn_wn在下游任务(检测&分割)上的性能

  4. 探究ln-bn-relu6模型为何会崩掉,relu6对于稳定训练的帮助到底是什么

  5. Deit-B的结果比论文低很多,用Deit原始的轮子训练看所有模型的结果

weiren1998 commented 2 years ago

Transformer-based architectures have achieved great success in Natural Language Processing (NLP) and Computer vision (CV). It was first proposed in the NLP tasks. In previous Convolutional Neural Networks (CNNs), the Batch Normalization (BN) is a widely-adopted method to reduce internal covariate shift and improve the generalization of the neural networks. While the Layer Normalization (LN) is commonly used in Recurrent Neural Networks (RNNs) and Transformers. When the Transformer was applied to CV tasks, it simply follow the original setting in NLP tasks by using LN because of the unstable training with BN. However, BN may be more proper to handle the image data and it could accelerate the inference process by using the moving average and variance. In this paper, we figure out that BN will make the variance between channels become very large, so we try to replace the original GeLU with a truncated ReLU6, and we also use the Weight Normalization (WN) to make the training more stable. Our experiments show that the BN-based Transformer model could have a comparable result with the LN-based model, and the BN-based model could speed up inference time by 12%.