zsyzzsoft / co-mod-gan

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks
Other
445 stars 67 forks source link

Errors in data preprocessing and training #13

Open athena913 opened 3 years ago

athena913 commented 3 years ago

Hi,

Thank you for making your code public. I used your code and instructions to train on my data but have the following issues. (Note that previously I was able to generate tfrec using the original StyleGAN2's data processing code and train using StyleGAN2's training code without modifying the code. So I am not sure why I am not able to get your code to work with my data, since you use the StyleGAN2 codebase).

1) I first tried to generate tfrec using the following cmdline: python dataset_tools/create_from_images.py --tfrecord-dir --train-image-dir --resolution 1024 --shuffle True --compressed True The images are being loaded and the number of images are reported correctly and the tfrec dir is created, but the tfrecords are not being saved (only one tfrec file with 0 bytes is saved).

2) So I generated tfrec using original stylgan2's data processing code and tried to train using the following cmd line: python run_training.py --data-dir= --result-dir= --dataset="" --num-gpus=4 --total-kimg=10000 --mirror-augment=True But it runs into OOM error even though I reduced the batch size from 32 to 16. The error is : unable to allocate memory for [1,128,1024,1024] tensor.

I was able to train using StyleGAN2's code using batch size 32 and 1024^2 resolution. So I would appreciate your help.

thanks

zsyzzsoft commented 3 years ago
  1. Try the script without --compressed.
  2. Try further reduce the batch size. You can check whether the dataset is correct if it works at a small batch size.