zeke-xie / stable-weight-decay-regularization

[NeurIPS 2023] The PyTorch Implementation of Scheduled (Stable) Weight Decay.
MIT License
56 stars 5 forks source link

Some questions about the experiment's scale #2

Closed JieTian-SALEN closed 8 months ago

JieTian-SALEN commented 8 months ago

Solid and excellent work, as a worker in engineering applications, have you tested the advantages of AdamS on larger experimental scales? For example ImageNet classification, COCO target detection, or even some Transformer models with billions of parameters. Most of the conclusions mentioned in the paper seem to be based on very small experimental volumes and have not been tested on today's modern architectures?

zeke-xie commented 8 months ago

Hi,

Actually, you are right.

Since this work mainly focused on the theoretical mechanism of weight decay and its large-gradient-norm pitfalls, we did not evaluate SWD on modern neural networks larger than ResNet50. However, at least, it works for ResNet50 on ImageNet.

AdamS has NO universal advantage over AdamW, especially for Transformers. I think the performance bottleneck of training Transformers is quite different from training CNNs, theoretically and empirically.

We may need to design novel weight decay strategies for Transformers.

JieTian-SALEN commented 8 months ago

Got it, I've been working on some low-level recovery tasks using 2d/3d conv networks recently. Since the data size is not that big, I think trying AdamS is a promising idea. Thanks again for contributing such a solid work.