xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Apache License 2.0
748 stars 56 forks source link

flux_example.py get stuck #363

Open fy1214 opened 4 days ago

fy1214 commented 4 days ago

i use this command:

torchrun --nproc_per_node=2 examples/flux_example.py --model /models/FLUX.1-dev --height 1024 --width 1024 --pipefusion_parallel_degree 2 --ulysses_degree 1 --ring_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

but it get stuck in _async_pipeline this is what it looks like:

W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing... DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing... DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2 INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2 Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.46it/s] You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.47it/s] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.71it/s] Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.98it/s] WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing... Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.85it/s] WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing... INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer... INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer... INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.19.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.20.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.21.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.22.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.23.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.24.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.25.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.26.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.27.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler... INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.35it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.41s/it] 5%|███████▋

it always stuck in 5%,and i dump the stack it looks like:

Thread 71079 (active): "MainThread" synchronize (torch/cuda/init.py:954) _communicate_shapes (xfuser/core/distributed/group_coordinator.py:865) _check_shape_and_buffer (xfuser/core/distributed/group_coordinator.py:796) recv_next (xfuser/core/distributed/group_coordinator.py:938) _async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:572) call (xfuser/model_executor/pipelines/pipeline_flux.py:319) check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186) data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166) wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218) decorate_context (torch/utils/_contextlib.py:116) main (flux_example.py:42)

(flux_example.py:81) Thread 71098 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:600) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1009) _bootstrap (threading.py:966) **anthor process was:** Thread 71080 (idle): "MainThread" isend (torch/distributed/distributed_c10d.py:2062) _pipeline_isend (xfuser/core/distributed/group_coordinator.py:968) pipeline_isend (xfuser/core/distributed/group_coordinator.py:921) _async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:634) __call__ (xfuser/model_executor/pipelines/pipeline_flux.py:319) check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186) data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166) wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218) decorate_context (torch/utils/_contextlib.py:116) main (flux_example.py:42) (flux_example.py:81) Thread 71099 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:600) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1009) _bootstrap (threading.py:966) maybe something goes wrong, pls give me a little help
fy1214 commented 4 days ago

find out why it get stuck, nccl get time out: rank0]:[E1125 17:44:44.587302788 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=SEND, NumelIn=6291456, NumelOut=6291456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [rank0]:[E1125 17:44:44.588313115 ProcessGroupNCCL.cpp:1785] [PG ID 4 PG GUID 11 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10, last enqueued NCCL work: 11, last completed NCCL work: 11. [rank1]:[W1125 17:44:44.646952859 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) returned 0, likely a timeout [rank1]:[W1125 17:44:44.647828424 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) timed out after 600000ms 0%| | 0/20 [10:00<?, ?it/s] [rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//12//cuda//0:1

yinfan98 commented 4 days ago

same error

fy1214 commented 4 days ago

same error

i use a100 instead of l20,and success now

feifeibear commented 4 days ago

I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0

fy1214 commented 3 days ago

I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0

i use NCCL_P2P_DISABLE=0 but still stuck in same place. i check the code acording to the stack infomation, looks like it stuck in this operation:

       if recv_prev:
            recv_prev_dim_tensor = torch.empty(
                (1), device=self.device, dtype=torch.int64
            )
            recv_prev_dim_op = torch.distributed.P2POp(
                torch.distributed.irecv,
                recv_prev_dim_tensor,
                self.prev_rank,
                self.device_group,
            )
            ops.append(recv_prev_dim_op)

        if tensor_send_to_next is not None:
            send_next_dim_tensor = torch.tensor(
                tensor_send_to_next.dim(), device=self.device, dtype=torch.int64
            )
            send_next_dim_op = torch.distributed.P2POp(
                torch.distributed.isend,
                send_next_dim_tensor,
                self.next_rank,
                self.device_group,
            )
            ops.append(send_next_dim_op)

        if len(ops) > 0:
            reqs = torch.distributed.batch_isend_irecv(ops)
            for req in reqs:
                req.wait()

        # To protect against race condition when using batch_isend_irecv().
        # should take this out once the bug with batch_isend_irecv is resolved.
        torch.cuda.synchronize()

maybe something wrong with nccl,l20 doesn't support?

feifeibear commented 3 days ago

To protect against race condition when using batch_isend_irecv().

   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()

Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

fy1214 commented 3 days ago

To protect against race condition when using batch_isend_irecv().

   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()

Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

感谢,如果有进展的话可以告诉我吗。我也想知道是因为什么导致这个问题的

Lay2000 commented 2 days ago

To protect against race condition when using batch_isend_irecv().

@fy1214 Hi, we found the issue might due to the usage of SHM in NCCL. You can try to export NCCL_DEBUG='INFO' to get more information, check if there is information like 'via SHM/direct/direct'. If so, try export NCCL_SHM_DISABLE='1' before running the scripts.