Great Stuff but Needs Better Usability

kayuksel commented 3 years ago

Hello,

Thanks for such a great work. Auto-initializing a DNN in a proper way definetely sounds amazing.

Yet, the usability needs to be significantly improved so that I can plug this in my existing networks.

It would be great if that could be as easy as installing and then importing an additional package.

We should maybe open a feature request in PyTorch so that they integrate this into the framework.

kayuksel commented 3 years ago

FYI, I have just opened a feature request in Pytorch repository as well: https://github.com/pytorch/pytorch/issues/52626

zhuchen03 commented 3 years ago

Thank you for your interest in our work, and opening the feature request! It is a good idea to make the code applicable to any network without having to adding some functions to the network class to switch between the states.

I just looked up the documents and feel this can be achievable without adding new features to PyTorch. We can use register_forward_pre_hook to switch into the updated parameters or let the BN layers switch to the training behavior without affecting dropout, and use remove (https://github.com/pytorch/pytorch/issues/5037) to switch between the states. I will look into the details later and hopefully release a version soon.

kayuksel commented 3 years ago

Sounds great!!! If we could also use it with any given optimizer, that would be perfect. Nobody uses default Adam anymore but e.g. optimizers from torch_optimizer package.

danarte commented 3 years ago

Hi, I'm also really interested in a general Pytorch implementation, I'll try it out as soon as some basic example code will be available and will be sure to cite the work in our upcoming publication. I'm not even sure if a package installation is the best route here since it sounds like it could be sufficient to add a few functions from gradinit_modules at certain points in the model code. Our group works with costume model (mix of rnn, cnn, dnn, attention, and transformers) so an explanation of where gradinit_modules should be implemented could help us immensely (instead of premade code for specific models).

In the end we will probably try to implement it by ourselves, but we would rather use the official instructions to insure it works properly.

Great work! thanks for the publication, Artem.

zhuchen03 commented 3 years ago

@kayuksel I have just pushed a new version to support any CNN that only has nn.Conv2d, nn.Linear and nn.BatchNorm2d as its parameterized layers. Please refer to the note in README for more details and feel free to ask if you have any further question.

@danarte Thanks for the interest in our work! Basically we use GradInit on all parameters of the network. We learn a scale factor for each weight and bias (if any, and non-zero at initialization). Please refer to the notes in the updated README.md to see how to extend to other models like Transformers. Basically we just need to enable iterating all trainable modules in a fixed order and take gradient steps (to compute the objective of Eq.1) for all their parameters. I will release the code for fairseq ASAP. Feel free to open a new issue if you have any question.

kayuksel commented 3 years ago

@zhuchen03 I see that it requires dataloader and seems to be specific to the classification. This is too specific unfortunately for me to use in some novel problems that I am working on (where initialization is crucial). But thank you for helping and letting me know about update.

zhuchen03 commented 3 years ago

@kayuksel I'm curious. How is your problem like? I think you can try it out as long as your model can be optimized with SGD. You just need to replace the loss function with yours. I agree the current version is restricted to image classification but it shouldn't be too difficult to adapt to other tasks. Happy to assist or maybe improve the API if you could provide more details.

kayuksel commented 3 years ago

@zhuchen03 In my case, it is a generative model that is trained by QHAdam (with an adaptive gradient clipping wrapper), which learns to continously generates population of solutions to e.g. a mathematical function.

In these type of reinforcement learning problems, the network initialization can be an important factor as it effects how the agent starts taking actions and hence how experiences are acquired to update the policy.

(leading to the severe reproducibility issues and random seed sensitivity of RL)

zhuchen03 commented 3 years ago

@kayuksel I see. I do not have much background in the problem you are trying to solve. From your description, it looks like you are using some Adam-like optimizer, and GradInit should be applicable as long as we can write down its update rule for the first step.

I can check whether there are other issues hindering implementing GradInit for your problem, if you could share some simple sample code.

kayuksel commented 3 years ago

Thanks @zhuchen03, how can I send you a sample code? Can I use the (cs.umd.edu) e-mail that is mentioned at your resume?

zhuchen03 commented 3 years ago

Yes that works. Thank you!

bknyaz commented 2 years ago

Hi @kayuksel, @danarte. Just in case I wanted to point out to our recent work https://github.com/facebookresearch/ppuda. You should be able to initialize almost any neural net in a single function call.

zhuchen03 / gradinit

Great Stuff but Needs Better Usability #1