williamyang1991 / DualStyleGAN

[CVPR 2022] Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer
Other
1.61k stars 249 forks source link

Training fails #54

Open wbfwonderful opened 1 year ago

wbfwonderful commented 1 year ago

Hi, I am training my own dataset on Colab follwing the steps of Readme, but the training fails in the second step of Facial destylization : "Step 2: Fine-tune StyleGAN". The error information is as followed:

load model: ./checkpoint/stylegan2-ffhq-config-f.pt 0%| | 0/600 [00:00<?, ?it/s] Traceback (most recent call last): File "finetune_stylegan.py", line 391, in train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device) File "finetune_stylegan.py", line 115, in train real_img = next(loader) File "/content/DualStyleGAN/util.py", line 58, in sample_data for batch in loader: File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/DualStyleGAN/model/stylegan/dataset.py", line 37, in getitem img = Image.open(buffer) File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2657, in open % (filename if filename else fp)) OSError: cannot identify image file <_io.BytesIO object at 0x7f070291c410> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2003) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_stylegan.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-11-23_07:27:01 host : a3b13d7b3fb3 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2003) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
williamyang1991 commented 1 year ago

This error seems to be not related to my code. It is an error about reading your training images. I haven't encountered this error before and have no idea on how to solve it.

Maybe you can search OSError: cannot identify image file and find the solution on the Internet.

exceedzhang commented 1 year ago

Hi, I am training my own dataset on Colab follwing the steps of Readme, but the training fails in the second step of Facial destylization : "Step 2: Fine-tune StyleGAN". The error information is as followed:

load model: ./checkpoint/stylegan2-ffhq-config-f.pt

0%| | 0/600 [00:00<?, ?it/s] Traceback (most recent call last): File "finetune_stylegan.py", line 391, in train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device) File "finetune_stylegan.py", line 115, in train real_img = next(loader) File "/content/DualStyleGAN/util.py", line 58, in sample_data for batch in loader: File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/DualStyleGAN/model/stylegan/dataset.py", line 37, in getitem img = Image.open(buffer) File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2657, in open % (filename if filename else fp)) OSError: cannot identify image file <_io.BytesIO object at 0x7f070291c410> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2003) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_stylegan.py FAILED

Failures:

# Root Cause (first observed failure): [0]: time : 2022-11-23_07:27:01 host : a3b13d7b3fb3 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2003) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I guess your training pictures should be placed in the images/train/directory @wbfwonderful