wgrathwohl / JEM

Project site for "Your Classifier is Secretly an Energy-Based Model and You Should Treat it Like One"
Apache License 2.0
415 stars 63 forks source link

Model training terminated #8

Open divymurli opened 3 years ago

divymurli commented 3 years ago

Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss.

Thanks!

Screenshot 2020-11-23 at 08 46 08
wgrathwohl commented 3 years ago

As I say in the paper, the best thing to do when the model diverges is to increase the number of mcmc steps or decrease the learning rate. EBMs are very finicky creatures! Thankfully, there's been lots of work on improving and stabilizing the training. One thing I read recently found smooth nonlinearities to make training considerably more stable. So, you could try a Swish and see if that helps out.

Cheers

On Mon, Nov 23, 2020 at 1:42 PM Divyanshu Murli notifications@github.com wrote:

Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss.

Thanks! [image: Screenshot 2020-11-23 at 08 46 08] https://user-images.githubusercontent.com/38363539/100001813-ea9aa800-2d80-11eb-8376-6f7ac75a9970.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/wgrathwohl/JEM/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYQS4QZO6HJLC4VBPE237TSRKUJXANCNFSM4T73SUOQ .

-- Will Grathwohl

Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

divymurli commented 3 years ago

Ah, and by MCMC steps do you mean SGLD (sorry not super familiar with MCMC)?

mwcvitkovic commented 3 years ago

Related question: so just to be clear, the code in the repo isn't the code used to create the results in the paper?

wgrathwohl commented 3 years ago

It is, but as we write in Appendix H.3:

"We find that when using PCD occasionally throughout training a sample will be drawn from the replay buffer that has a considerably higher-than average energy (higher than the energy of a random initialization). This causes the gradients w.r.t this example to be orders of magnitude larger than gradients w.r.t the rest of the examples and causes the model to diverge. We tried a number of heuristic approaches such as gradient clipping, energy clipping, ignoring examples with atypical energy values, and many others but could not find an approach that stabilized training and did not hurt generative and discriminative performance."

I will be the first to admit that EBM training in this way is a nightmare and requires pretty consistent baby-sitting. At the moment these models are basically where GANs were in like 2014. Not easy to train. Requires a lot of hand-tuning. The main point of this paper was to demonstrate the utility of these models if they can be trained. There have been a number of improvements which can stabilize EBM training.

You should be able to train these models with some combo of restarts, lr decrease, and mcmc step increase. I hope that helps.

On Mon, Nov 23, 2020 at 1:53 PM Milan Cvitkovic notifications@github.com wrote:

Related question: so just to be clear, the code in the repo isn't the code used to create the results in the paper?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wgrathwohl/JEM/issues/8#issuecomment-732357442, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYQS4X7ZBKVDMDFS2X5S5TSRKVSPANCNFSM4T73SUOQ .

-- Will Grathwohl

Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

mwcvitkovic commented 3 years ago

Definitely helpful, and much appreciated. I'm just curious whether the line of code in the README worked for you, but isn't working for @divymurli.

That would be surprising considering that random draws from the buffer should be deterministic under the random seeds you set in the training scripts. I can't see what the source of randomness would be.

andiac commented 10 months ago

As I say in the paper, the best thing to do when the model diverges is to increase the number of mcmc steps or decrease the learning rate. EBMs are very finicky creatures! Thankfully, there's been lots of work on improving and stabilizing the training. One thing I read recently found smooth nonlinearities to make training considerably more stable. So, you could try a Swish and see if that helps out. Cheers On Mon, Nov 23, 2020 at 1:42 PM Divyanshu Murli @.***> wrote: Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss. Thanks! [image: Screenshot 2020-11-23 at 08 46 08] https://user-images.githubusercontent.com/38363539/100001813-ea9aa800-2d80-11eb-8376-6f7ac75a9970.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYQS4QZO6HJLC4VBPE237TSRKUJXANCNFSM4T73SUOQ . -- Will Grathwohl Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

Thanks, increasing MCMC steps helps a lot.