Problems with training on Google Colab - FIXED

Particle1904 commented 4 months ago

Running the command: !torchrun --nproc_per_node=1 --master_port=8765 /content/DualStyleGAN/finetune_stylegan.py --iter 600 --batch 4 --ckpt /content/DualStyleGAN/checkpoint/stylegan2-ffhq-config-f.pt --style mydataset /content/DualStyleGAN/data/mydataset/lmdb/

Not sure what is the problem... tried to search for it and nothing so far.


Load options
ada_every: 256
ada_length: 500000
ada_target: 0.6
augment: False
augment_p: 0
batch: 4
channel_multiplier: 2
ckpt: /content/DualStyleGAN/checkpoint/stylegan2-ffhq-config-f.pt
d_reg_every: 16
g_reg_every: 4
iter: 600
local_rank: 0
lr: 0.002
mixing: 0.9
model_path: ./checkpoint/
n_sample: 9
path: /content/DualStyleGAN/data/mydataset/lmdb/
path_batch_shrink: 2
path_regularize: 2
r1: 10
save_every: 10000
size: 1024
style: mydataset
wandb: False
**************************************************************************************************
load model: /content/DualStyleGAN/checkpoint/stylegan2-ffhq-config-f.pt
  0%|                                                                                                               | 0/600 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/DualStyleGAN/finetune_stylegan.py", line 391, in <module>
    train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
  File "/content/DualStyleGAN/finetune_stylegan.py", line 115, in train
    real_img = next(loader)
  File "/content/DualStyleGAN/util.py", line 58, in sample_data
    for batch in loader:
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/DualStyleGAN/model/stylegan/dataset.py", line 37, in __getitem__
    img = Image.open(buffer)
  File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3283, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7b6ee9593510>
[2024-04-26 06:59:00,662] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 14123) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/content/DualStyleGAN/finetune_stylegan.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-26_06:59:00
  host      : 22dc9dd488ca
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 14123)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================````

Particle1904 commented 4 months ago

After 3 days of first fighting google colab, python and conda I finally figured out how to run the training in Google Colab. If anyone is interested, I'll share the colab notebook.

kojuwonresearch commented 1 month ago

If the same error appears, it is recommended to check the size set in step prepare_data. If you made it 512 size, for example, you should write down the --size 512 flag when running finetune_stylegan.py.

williamyang1991 / DualStyleGAN

Problems with training on Google Colab - FIXED #102