Closed luke-mcdermott-mi closed 1 year ago
Hi, Apologies for the late response. Though we did not explicitly run a proper experiment for this method for larger networks, we found that in practice this method struggles at scaling to ResNets and much deeper architectures. There are a few reasons for this I believe. Firstly is that the NNGP approximation we use has bias/variance scaling with D/W, where D = depth, W = width. Furthermore, the NTK/NNGP model of neural network training tends to be quite a poor approximation for this deep models, due to higher learning rates, things such as batchnorm and increased variance associated with the finite NTK/NNGP.
How well does this method generalize to larger architectures? What would be the best next step to tweaking the algorithm to accommodate generalization to ResNet?