Closed JieTian-SALEN closed 8 months ago
Hi,
Actually, you are right.
Since this work mainly focused on the theoretical mechanism of weight decay and its large-gradient-norm pitfalls, we did not evaluate SWD on modern neural networks larger than ResNet50. However, at least, it works for ResNet50 on ImageNet.
AdamS has NO universal advantage over AdamW, especially for Transformers. I think the performance bottleneck of training Transformers is quite different from training CNNs, theoretically and empirically.
We may need to design novel weight decay strategies for Transformers.
Got it, I've been working on some low-level recovery tasks using 2d/3d conv networks recently. Since the data size is not that big, I think trying AdamS is a promising idea. Thanks again for contributing such a solid work.
Solid and excellent work, as a worker in engineering applications, have you tested the advantages of AdamS on larger experimental scales? For example ImageNet classification, COCO target detection, or even some Transformer models with billions of parameters. Most of the conclusions mentioned in the paper seem to be based on very small experimental volumes and have not been tested on today's modern architectures?