Hi, thanks for your great work!
I encountered an error when trying to run the CogVideoX model on a single A800. The error occurs in the forward pass of the transformer, specifically when adding positional embeddings to the hidden states. The tensor sizes do not match, which suggests a potential issue with the model's implementation or configuration.
Here is the output log:
srun --account=test --exclusive=user -p A800 -N 1 --time 20:00 --job-name=xdit --ntasks-per-node=1 --gres=gpu:1 --export=ALL bash ./examples/infer.sh
srun: job 3975 queued and waiting for resources
srun: job 3975 has been allocated resources
++ awk '{print $2}'
++ grep BatchHost
++ tr = ' '
++ scontrol show jobid=3975
export MASTER_ADDR=g41
MASTER_ADDR=g41
++ expr 0 / 1
export NODE_RANK=0
NODE_RANK=0
export CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
CUDA_LAUNCH_BLOCKING=1
export RANK=0
RANK=0
export LOCAL_RANK=0
LOCAL_RANK=0
exec python -W ignore ./examples/cogvideox_example.py --model /home/test/test01/cyy/Data/models--THUDM--CogVideoX-2b/snapshots/ad5ce8664edfdc95cdb9773dd4f80048b25f69ac --ulysses_degree 1 --num_inference_steps 1 --warmup_steps 0 --prompt 'sunset over the sea.'
WARNING 09-10 12:13:01 [args.py:250] Distributed environment is not initialized. Initializing...
DEBUG 09-10 12:13:01 [parallel_state.py:180] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
===========0 - Parallel Group Initalized!===========
INFO 09-10 12:13:01 [config.py:93] Ring degree not set, using default value 1
INFO 09-10 12:13:01 [config.py:137] Pipeline patch number not set, using default value 1
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00, 4.15s/it]it/s]
Loading pipeline components...: 100%|██████████| 5/5 [00:12<00:00, 2.45s/it]
WARNING 09-10 12:13:13 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 09-10 12:13:13 [base_pipeline.py:236] Transformer backbone found, but model parallelism is not enabled, use naive model
INFO 09-10 12:13:13 [base_pipeline.py:286] Scheduler found, paralleling scheduler...
0%| | 0/1 00:00<?, ?it/s: Traceback (most recent call last):
rank0: File "/home/test/test01/cyy/xDiT/./examples/cogvideox_example.py", line 63, in rank0: File "/home/test/test01/cyy/xDiT/./examples/cogvideox_example.py", line 38, in main
rank0: output = pipe(
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, kwargs)
rank0: File "/home/test/test01/cyy/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 131, in data_parallel_fn
rank0: return func(self, *args, *kwargs)
rank0: File "/home/test/test01/cyy/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 145, in check_naive_forward_fn
rank0: return self.module(args, kwargs)
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, kwargs)
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 687, in callrank0: noise_pred = self.transformer(
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(args, kwargs)
rank0: File "/home/test/test01/anaconda3/envs/xdit/lib/python3.10/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 432, in forward
rank0: hidden_states = hidden_states + pos_embeds
rank0: RuntimeError: The size of tensor a (53474) must match the size of tensor b (17776) at non-singleton dimension 1
srun: error: g41: task 0: Exited with exit code 1
Additional notes:
I did not modify any settings or configurations from the default.
This error occurs consistently when trying to run the model.
Hi, thanks for your great work! I encountered an error when trying to run the CogVideoX model on a single A800. The error occurs in the forward pass of the transformer, specifically when adding positional embeddings to the hidden states. The tensor sizes do not match, which suggests a potential issue with the model's implementation or configuration.
Here is the output log:
Additional notes:
Thanks for your help!