yolky / RFAD

Code for the paper "Efficient Dataset Distillation using Random Feature Approximation"
36 stars 4 forks source link

Cross Architecture Generalization #3

Closed luke-mcdermott-mi closed 1 year ago

luke-mcdermott-mi commented 1 year ago

How well does this method generalize to larger architectures? What would be the best next step to tweaking the algorithm to accommodate generalization to ResNet?

yolky commented 1 year ago

Hi, Apologies for the late response. Though we did not explicitly run a proper experiment for this method for larger networks, we found that in practice this method struggles at scaling to ResNets and much deeper architectures. There are a few reasons for this I believe. Firstly is that the NNGP approximation we use has bias/variance scaling with D/W, where D = depth, W = width. Furthermore, the NTK/NNGP model of neural network training tends to be quite a poor approximation for this deep models, due to higher learning rates, things such as batchnorm and increased variance associated with the finite NTK/NNGP.