Open feifeibear opened 5 days ago
In addition to parallel degree setting constrain, there are performance issues with SP implementation. On an L40 machine, using two GPUs is slower than using 1 GPU.
1 GPU output saved to results/cogvideox_dp1_cfg1_ulysses1_ring1_tp1_pp1_patchNone_720x640.mp4 epoch time: 2.42 sec, memory: 28.733202944 GB
2 GPU output saved to results/cogvideox_dp1_cfg1_ulysses2_ring1_tp1_pp1_patchNone_720x640.mp4 epoch time: 2.58 sec, memory: 29.172894208 GB
xDiT currently implements the sequential parallel version of CogVideoX. However, there are restrictions when using it:
head_num
(30 here) %ulysses_degree
== 0height
%sp_degree
== 0--height 640 --width 720
withsp_degree = 8
(uly=2, ring=4), the VAE decoder throws an error.