Open DAVEISHAN opened 1 month ago
@DAVEISHAN try setting mixed precesion in train_xl.sh file, it solved the issue fr me.
Thank you @Alexsumt, however, I am getting random nan
as the step_losses
, not sure what's the issue. Any leads?
@Alexsumt which type mixed precession did you set?
@aaaqianqian BF16.
BF16
On Mon, Aug 5, 2024 at 9:49 PM aaaqianqian @.***> wrote:
@Alexsumt https://github.com/Alexsumt which type mixed precession did you set?
— Reply to this email directly, view it on GitHub https://github.com/yisol/IDM-VTON/issues/111#issuecomment-2269446753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXWJO6IG4X5PG2WKA3VNT3ZP6Q25AVCNFSM6AAAAABLVTZU7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQ2DMNZVGM . You are receiving this because you were mentioned.Message ID: @.***>
I am facing a
RuntimeError
related to dtype mismatches during the forward pass of training code.down, reference_features = unet_encoder(cloth_values, timesteps, text_embeds_cloth,return_dict=False) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/task_runtime/src/unet_hacked_garmnet.py", line 1052, in forward emb = self.time_embedding(t_emb, timestep_cond) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/miniforge/envs/idm/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 228, in forward sample = self.linear_1(sample) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/miniforge/envs/idm/lib/python3.10/site-packages/diffusers/models/lora.py", line 430, in forward out = super().forward(hidden_states) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype
I've found that converting vae to regular
torch.float32
resolves this issue. Like the following:vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path,subfolder="vae")
instead of your code:vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path,subfolder="vae",torch_dtype=torch.float16,)
However, before standardizing this change across the codebase, I would like to confirm if it is advisable to do so.
Thank you!