Open weiren1998 opened 2 years ago
import cv2
import numpy as np
img = x.cpu().numpy() # 将torch变为numpy
img = (img - np.min(img))/(np.max(img) - np.min(img)) *255.0 # 归一化*255
img = img.astype(int) # 转为int
cv2.imwrite('./100_msa.jpg', img[100]) # 将一个feature输出
结论:
- LN输出的feature map的均值比较小,方差比较大。方差大的原因是有个别channel存在极大值和极小值
- BN输出的feature map的均值比较小,方差比较大。方差大的原因是每个所有特征的值差异都比较大,没有规律
- LN除去极大值和极小值的channel外,其他channel相对来说比较接近,同时随深度越深,这种趋势越明显
- MLP比MSA层的变化的更大,这也是在MLP中加入BN可以稳定训练的原因之一
先看一下可以完成训练的ln模型的feature map:ln的feature map中C的某个维度有极大值和极小值,原因是ln中有weight过大和过小
# base ln epoch 50 layer1
(Pdb) self.norm1.weight.min()
tensor(-0.0013, device='cuda:0')
(Pdb) self.norm1.weight.max()
tensor(0.3647, device='cuda:0')
(Pdb) self.norm2.weight.max()
tensor(1.5141, device='cuda:0')
(Pdb) self.norm2.weight.min()
tensor(0.0197, device='cuda:0')
Model | BN | ImNet top-1 (%) | Para (M) | Throughput (img/s) | Embed dim | #heads | #depth |
---|---|---|---|---|---|---|---|
Deit-Ti | × | 72.56 (72.2) | 5 | 2536.5 | 192 | 3 | 12 |
Deit-S | × | 80.47 (79.8) | 22 | 940.4 | 384 | 6 | 12 |
Deit-B | × | 78.74 (81.8) | 86 | 292.3 | 768 | 12 | 12 |
Deit-L | × | 77.83 | 153.2 | 1024 | 16 | 12 | |
Deit-Ti-bn | √ | 70.26 | 5.7 | 192 | 3 | 12 | |
Deit-S-bn | √ | 77.14 (236 epo) | 22.1 | 237epoch acc掉的厉害,看看原因 | 384 | 6 | 12 |
Deit-B-bn | √ | 48.14 (13 epo) | 86.6 | 768 | 12 | 12 | |
Deit-L-bn | √ | 41.88 (9 epo) | 153.2 | 1024 | 16 | 12 |
No. | Model | BN | ImNet top-1 (%) | Para (M) | Throughput (img/s) | Embed dim | #heads | #depth |
---|---|---|---|---|---|---|---|---|
Deit-Ti | × | 72.56 (72.2) | 5 | 2536.5 | 192 | 3 | 12 | |
Deit-S | × | 80.47 (79.8) | 22 | 940.4 | 384 | 6 | 12 | |
Deit-B | × | 78.74 (81.8) | 86.6 | 292.3 | 768 | 12 | 12 | |
Deit-L | × | 77.83 | 153.2 | 1024 | 16 | 12 | ||
1 | Deit-B-bn_relu6 | √ | 79.26 | 86.6 | 768 | 12 | 12 | |
2 | Deit-L-bn_relu6 | √ | NaN (35 epo) | 153.2 | 看会不会崩 | |||
3 | Deit-B-bn_wn | √ | 78.77 priv | 86.7 | 看wn模块是否有用 | 768 | 12 | 12 |
4 | Deit-L-bn_wn | √ | BJ ing | 153.3 | 看看wn会不会崩的快 | |||
5 | Deit-B-bn_wn_relu6 | √ | 80.25 SH BJ ing | 86.7 | 看relu6和wn一起是否有好的效果 | |||
6 | Deit-B_ln_bn_wn | √ | NaN (80.7 234 epo) | 86.7 | 看LN+BN+WN能不能训练 | |||
7 | Deit-B_ln_bn_relu6 | √ | NaN (8 epo) | 86.7 | 查找NaN原因 | |||
8 | Deit-B_ln_bn_wn_relu6 | √ | NaN (77.65 172 epo) | 6.7 | 看LN+BN+WN+relu6能不能训练 | |||
9 | Swin-S_bn_wn_relu6 | √ | ipt ing | 49.7 | 看是否适用于swin | batch 256 | ||
10 | Swin-T_bn_wn_relu6 | √ | 80.75 ipt (81.3) | 28.3 | 准备在检测任务上看效果 | batch 256 |
目前ln+bn+wn这个机制可以训练,就是不知道能不能提点
ln+bn+wn+relu6可以帮助加速训练,不知道能不能提点
No. | Model | T_train | ImNet top-1 (%) | Para (M) | Throughput (img/s) |
---|---|---|---|---|---|
1 | Deit-T_ln_bn_wn | 69.87 | BJ ing | 5.7 | |
2 | Deit-T_ln_bn_wn_relu6 | 69.16 | BJ ing | 5.7 | |
3 | Deit-S_ln_bn_wn | 78.76 | BJ ing | 22.1 | |
4 | Deit-S_ln_bn_wn_relu6 | 78.24 | BJ ing | ||
5 | Deit-B_ln_bn_wn | NaN (80.70 234 epo) | 上setting6 | ||
6 | Deit-B_ln_bn_wn_relu6 | NaN (77.65 172 epo) | 上setting8 |
No. | Model | T_train | ImNet top-1 (%) | Para (M) | Throughput (img/s) |
---|---|---|---|---|---|
1 | Deit-T_bn_relu6 | √ | BJ ing | 5.7 | |
2 | Deit-T_bn_wn | √ | BJ ing | 5.7 | |
3 | Deit-T_bn_wn_relu6 | √ | priv BJ ing | 5.7 | |
4 | Deit-S_bn_relu6 | √ | ipt BJ ing | 22.1 | |
5 | Deit-S_bn_wn | √ | ipt BJ ing | 22.1 | |
6 | Deit-S_bn_wn_relu6 | √ | priv BJ ing | 22.1 | |
7 | Swin-T_bn_relu6 | √ | BJ ing | 28.3 | |
8 | Swin-T_bn_wn | √ | BJ ing | 28.3 | |
9 | Swin-T_bn_wn_relu6 | √ | 80.75 ipt (81.3) | 28.3 | |
11 | Swin-S_bn_relu6 | √ | ipt BJ ing | 49.6 | |
12 | Swin-S_bn_wn | √ | ipt BJ ing | 49.7 | |
13 | Swin-S_bn_wn_relu6 | √ | priv SH ing | 49.7 |
No. | Model | ImNet top-1 (%) | batchsize | Throughput (img/s) | GFLOPs |
---|---|---|---|---|---|
1 | Deit-T | 256? | |||
2 | Deit-T_bn_wn_relu6 | 256? | |||
3 | Deit-B | 256? | |||
4 | Deit-B_bn_wn_relu6 | 256? | |||
5 | Swin-T | 256? | |||
6 | Swin-T_bn_wn_relu6 | 256? | |||
7 | Swin-B | 256? | |||
8 | Swin-B_bn_wn_relu6 | 256? |
No. | Backbone | ImgNet top1 | ImgNet top5 | COCO APbox | COCO APmask | ADE20k mIoU |
---|---|---|---|---|---|---|
1 | Swin-T | 81.3 | 95.6 | 50.5 | 43.5 | 46.1 |
2 | Swin-T_bn_wn_relu6 |
将LN-BN模型加入WN (-+relu6) 来实验 √
测试推理速度,测试推理时的throughput,比较Deit-T和Deit-T_bn_wn_relu6
测试Swin-T_bn_relu6和Swin-T_bn_wn在下游任务(检测&分割)上的性能
探究ln-bn-relu6模型为何会崩掉,relu6对于稳定训练的帮助到底是什么
Deit-B的结果比论文低很多,用Deit原始的轮子训练看所有模型的结果
Transformer-based architectures have achieved great success in Natural Language Processing (NLP) and Computer vision (CV). It was first proposed in the NLP tasks. In previous Convolutional Neural Networks (CNNs), the Batch Normalization (BN) is a widely-adopted method to reduce internal covariate shift and improve the generalization of the neural networks. While the Layer Normalization (LN) is commonly used in Recurrent Neural Networks (RNNs) and Transformers. When the Transformer was applied to CV tasks, it simply follow the original setting in NLP tasks by using LN because of the unstable training with BN. However, BN may be more proper to handle the image data and it could accelerate the inference process by using the moving average and variance. In this paper, we figure out that BN will make the variance between channels become very large, so we try to replace the original GeLU with a truncated ReLU6, and we also use the Weight Normalization (WN) to make the training more stable. Our experiments show that the BN-based Transformer model could have a comparable result with the LN-based model, and the BN-based model could speed up inference time by 12%.
https://weiren1998.github.io/archives/697052e2.html#more
在探索网络架构的过程中,需要做很多尝试和思考,同时也需要把实验数据和对于结果的思考等记录下来,从而一点点积累感觉
最近在做本科的毕业设计,题目是足球视频中的行为关键帧检测算法设计。在实验过程中,发现有很多细小的想法,但有时一晃而过,可能是一些小的尝试,但很少会做对比试验,因此将这些想法记录下来,方便之后再做改进。