xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
484 stars 40 forks source link

Contrains on Parallel Setting for CogVideoX and Performance issue #265

Open feifeibear opened 5 days ago

feifeibear commented 5 days ago

xDiT currently implements the sequential parallel version of CogVideoX. However, there are restrictions when using it:

  1. head_num (30 here) % ulysses_degree == 0
  2. height % sp_degree == 0
  3. If we use --height 640 --width 720 with sp_degree = 8 (uly=2, ring=4), the VAE decoder throws an error.
[rank7]:   File "/cfs/fjr2/xDiT/xfuser/model_executor/pipelines/pipeline_cogvideox.py", line 372, in __call__
[rank7]:     video = self.decode_latents(latents)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 360, in decode_latents
[rank7]:     frames = self.vae.decode(latents).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank7]:     return method(self, *args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1153, in decode
[rank7]:     decoded = self._decode(z).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1123, in _decode
[rank7]:     z_intermediate = self.decoder(z_intermediate)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 851, in forward
[rank7]:     hidden_states = self.conv_in(sample)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
[rank7]:     output = self.conv(inputs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 64, in forward
[rank7]:     return super().forward(input)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 608, in forward
[rank7]:     return self._conv_forward(input, self.weight, self.bias)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
[rank7]:     return F.conv3d(
[rank7]: RuntimeError: Given groups=1, weight of size [512, 16, 3, 3, 3], expected input[1, 10, 5, 82, 92] to have 16 channels, but got 10 channels instead
feifeibear commented 5 days ago

In addition to parallel degree setting constrain, there are performance issues with SP implementation. On an L40 machine, using two GPUs is slower than using 1 GPU.

1 GPU output saved to results/cogvideox_dp1_cfg1_ulysses1_ring1_tp1_pp1_patchNone_720x640.mp4 epoch time: 2.42 sec, memory: 28.733202944 GB

2 GPU output saved to results/cogvideox_dp1_cfg1_ulysses2_ring1_tp1_pp1_patchNone_720x640.mp4 epoch time: 2.58 sec, memory: 29.172894208 GB