Open algorithmconquer opened 1 day ago
Please provide more context.
@feifeibear The run command is as:
torchrun --nproc_per_node=8 examples/flux_example.py \ --model ${modelId} \ --height 1024 --width 1024 \ --pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 2 \ --num_inference_steps 28 --warmup_steps 0 --prompt "A small dog" \ --no_use_resolution_binning
;
The environment is:
xfuser==0.3.5
diffusers==0.32.0.dev0
torch==2.4.0+cu124;
The cuda device is:H20, 8cards
I could not reproduce your error. Can you run it successfully with 1 gpu?
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)