Open fy1214 opened 4 days ago
find out why it get stuck, nccl get time out: rank0]:[E1125 17:44:44.587302788 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=SEND, NumelIn=6291456, NumelOut=6291456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [rank0]:[E1125 17:44:44.588313115 ProcessGroupNCCL.cpp:1785] [PG ID 4 PG GUID 11 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10, last enqueued NCCL work: 11, last completed NCCL work: 11. [rank1]:[W1125 17:44:44.646952859 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) returned 0, likely a timeout [rank1]:[W1125 17:44:44.647828424 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) timed out after 600000ms 0%| | 0/20 [10:00<?, ?it/s] [rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//12//cuda//0:1
same error
same error
i use a100 instead of l20,and success now
I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0
I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0
i use NCCL_P2P_DISABLE=0 but still stuck in same place. i check the code acording to the stack infomation, looks like it stuck in this operation:
if recv_prev:
recv_prev_dim_tensor = torch.empty(
(1), device=self.device, dtype=torch.int64
)
recv_prev_dim_op = torch.distributed.P2POp(
torch.distributed.irecv,
recv_prev_dim_tensor,
self.prev_rank,
self.device_group,
)
ops.append(recv_prev_dim_op)
if tensor_send_to_next is not None:
send_next_dim_tensor = torch.tensor(
tensor_send_to_next.dim(), device=self.device, dtype=torch.int64
)
send_next_dim_op = torch.distributed.P2POp(
torch.distributed.isend,
send_next_dim_tensor,
self.next_rank,
self.device_group,
)
ops.append(send_next_dim_op)
if len(ops) > 0:
reqs = torch.distributed.batch_isend_irecv(ops)
for req in reqs:
req.wait()
# To protect against race condition when using batch_isend_irecv().
# should take this out once the bug with batch_isend_irecv is resolved.
torch.cuda.synchronize()
maybe something wrong with nccl,l20 doesn't support?
To protect against race condition when using batch_isend_irecv().
# should take this out once the bug with batch_isend_irecv is resolved. torch.cuda.synchronize()
Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.
@Lay2000 will help on debuging this issue.
To protect against race condition when using batch_isend_irecv().
# should take this out once the bug with batch_isend_irecv is resolved. torch.cuda.synchronize()
Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.
@Lay2000 will help on debuging this issue.
感谢,如果有进展的话可以告诉我吗。我也想知道是因为什么导致这个问题的
To protect against race condition when using batch_isend_irecv().
@fy1214 Hi, we found the issue might due to the usage of SHM in NCCL. You can try to export NCCL_DEBUG='INFO'
to get more information, check if there is information like 'via SHM/direct/direct'. If so, try export NCCL_SHM_DISABLE='1'
before running the scripts.
i use this command:
torchrun --nproc_per_node=2 examples/flux_example.py --model /models/FLUX.1-dev --height 1024 --width 1024 --pipefusion_parallel_degree 2 --ulysses_degree 1 --ring_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"
but it get stuck in _async_pipeline this is what it looks like:
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing... DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing... DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2 INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2 Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set
add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.46it/s] You setadd_prefix_space
. The tokenizer needs to be converted from the slow tokenizers████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.47it/s] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.71it/s] Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.98it/s] WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing... Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.85it/s] WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing... INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer... INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer... INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.19.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.20.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.21.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.22.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.23.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.24.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.25.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.26.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.27.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler... INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.35it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.41s/it] 5%|███████▋it always stuck in 5%,and i dump the stack it looks like:
Thread 71079 (active): "MainThread" synchronize (torch/cuda/init.py:954) _communicate_shapes (xfuser/core/distributed/group_coordinator.py:865) _check_shape_and_buffer (xfuser/core/distributed/group_coordinator.py:796) recv_next (xfuser/core/distributed/group_coordinator.py:938) _async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:572) call (xfuser/model_executor/pipelines/pipeline_flux.py:319) check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186) data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166) wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218) decorate_context (torch/utils/_contextlib.py:116) main (flux_example.py:42)