Cross Architecture Generalization

Hi, Apologies for the late response. Though we did not explicitly run a proper experiment for this method for larger networks, we found that in practice this method struggles at scaling to ResNets and much deeper architectures. There are a few reasons for this I believe. Firstly is that the NNGP approximation we use has bias/variance scaling with D/W, where D = depth, W = width. Furthermore, the NTK/NNGP model of neural network training tends to be quite a poor approximation for this deep models, due to higher learning rates, things such as batchnorm and increased variance associated with the finite NTK/NNGP.

yolky / RFAD

Cross Architecture Generalization #3