Training Code: dtype Error During Model Forward Pass

DAVEISHAN commented 1 month ago

I am facing a RuntimeError related to dtype mismatches during the forward pass of training code. down, reference_features = unet_encoder(cloth_values, timesteps, text_embeds_cloth,return_dict=False) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/task_runtime/src/unet_hacked_garmnet.py", line 1052, in forward emb = self.time_embedding(t_emb, timestep_cond) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/miniforge/envs/idm/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 228, in forward sample = self.linear_1(sample) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/miniforge/envs/idm/lib/python3.10/site-packages/diffusers/models/lora.py", line 430, in forward out = super().forward(hidden_states) File "/miniforge/envs/idm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype

I've found that converting vae to regular torch.float32 resolves this issue. Like the following: vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path,subfolder="vae") instead of your code: vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path,subfolder="vae",torch_dtype=torch.float16,)

However, before standardizing this change across the codebase, I would like to confirm if it is advisable to do so.

Thank you!

Alexsumt commented 1 month ago

@DAVEISHAN try setting mixed precesion in train_xl.sh file, it solved the issue fr me.

DAVEISHAN commented 1 month ago

Thank you @Alexsumt, however, I am getting random nan as the step_losses, not sure what's the issue. Any leads?

aaaqianqian commented 1 month ago

@Alexsumt which type mixed precession did you set?

Alexsumt commented 1 month ago

@aaaqianqian BF16.

Alexsumt commented 1 week ago

BF16

On Mon, Aug 5, 2024 at 9:49 PM aaaqianqian @.***> wrote:

@Alexsumt https://github.com/Alexsumt which type mixed precession did you set?

— Reply to this email directly, view it on GitHub https://github.com/yisol/IDM-VTON/issues/111#issuecomment-2269446753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXWJO6IG4X5PG2WKA3VNT3ZP6Q25AVCNFSM6AAAAABLVTZU7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQ2DMNZVGM . You are receiving this because you were mentioned.Message ID: @.***>

yisol / IDM-VTON

Training Code: dtype Error During Model Forward Pass #111