Closed sentialx closed 2 years ago
In the paper under the limitations section, it is stated that "Second, our current G.pt models struggle to extrapolate to losses and errors not present in the pre-training data.", but what does it mean "struggle"? Is it completely unable to optimize the model further or it just doesn't achieve the desired loss perfectly?
Hi @sentialx, thanks for the question. The models are usually unable to generate parameters for losses outside the training set's range. The specific behavior seems to depend on how "extreme" the prompted loss is relative to those in the training set. For example, if you ask for a loss slightly lower than the best in the training set (e.g., asking for 0 test loss when the best in MNIST is ~0.2), the model will typically output a ~0.2 loss network (Figure 5 in the paper). On the other hand, if you ask for a drastically lower value (e.g., asking for zero loss on CIFAR-10 when the lowest in the training set is like 1.1), the model will give you a worse-performing network than if you asked for something with a slightly higher loss closer to the bounds of the training set. Hope this helps!
Let's say I generate millions of checkpoints of a model using Adam optimizer and it reaches a minimum loss of e.g. 0.8. Can G.pt generate weights for loss lower than 0.8?