Open BadeiAlrahel opened 5 months ago
Sorry, I have tried to train the model without an autoencoder, it worked well but required a very long training time. We haven't enough resources to train such a model. @BadeiAlrahel
Thank you for the fast response ! What do you mean exactly by a lot of time and ressources ? Did not you train on a A100? If yes how much time do you think the training would have taken on the A100 to completely train 500k iterations without the VQGAN ?
I can't remember the exact time for training 500k iterations. The backbone contains some blocks of Swin Transformer. These Swin transformer blocks are very slow if we directly train on the image space. I suggest you first fine-tune the autoencoder and then train ResShift.
Thank you for the rapid response ! I was actually trying to finetune your VQGAN that you included in the given weights but I could not find the code to compute the loss function in your repository, so I went to the LDM github just to get their loss function, which is LPIPSWithDiscriminator but the config file for their VQGAN with embed_dim=3 does not use the same parameters as yours... There were many errors when I used their training code and the loss function. Can you maybe give me some insights on how to proceed in order to finetune your VQGAN for my data ?
The VQGAN model is trained with this repo. You should modify the config file by yourself.
yeah I know that, I have surely looked at that repository but the problem is the config file where none of them were like the one used by ResShift ( yours has the following : embed_dim is not 3, z_channels is not equal to 3, attn_resolutions: []), I know that some changes in your code to adapt ResShift to handle the z_channels and embed_dim would do the work but I am not sure of the results that this would produce... How have you actually trained your VQGAN ? Did you use the taming transformers github ?
No, the employed VQGAN model in ResShift is directly borrowed from LDM. I have not trained or fine-tuned it.
No, the employed VQGAN model in ResShift is directly borrowed from LDM. I have not trained or fine-tuned it.
Can a 512-size autoencoder be used only on faces, or can it be applied to other types of data as well?
For the face restoration, the VQGAN is trained by myself. The checkpoint is accessable here. @UserYuXuan
For the face restoration, the VQGAN is trained by myself. The checkpoint is accessable here. @UserYuXuan
Thank you very much for your prompt reply!I tried to replace the autoencoder in this model with the f8 VQGAN and KL from the LDM model, and even if I change the configuration in the YML file to the parameters in LDM, I still encounter an error: RuntimeError: Error(s) in loading state_dict for VQModelTorch: Missing key(s) in state_dict: "encoder.conv_in.weight", "encoder.conv_in.bias", "encoder.down.0.block.0.norm1.weight",... Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "callbacks", "optimizer_states", "lr_schedulers".
Thanks for the good work ! I assessed the quality of VQGAN on my data and it was really poor which caused poor quality as well when I used your model. So I want to not use any autoencoder anymore and was wondering if you have released your model weights without using any autoencoder since in the official paper Figure 2 you said that using an autoencoder is optional. I would really appreciate it, otherwise I would need to train the model from scratch...