xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
484 stars 40 forks source link

FLUX with SP 并行生成图像差异 #262

Open lixiang007666 opened 6 days ago

lixiang007666 commented 6 days ago

问题描述

固定 seed 测了下,为了确认 seed 是固定的,先重复运行了多卡脚本,确保每次图像不变。

在这个条件下,不同卡数生成的图像:

image
flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0

可以看到在 1024 下是有损的,但是两卡的时候损失比较的小(也分seed)。

一个观察是 512 下损失会更大。

复现脚本:

set -x

# export NCCL_PXN_DISABLE=1
# # export NCCL_DEBUG=INFO
# export NCCL_SOCKET_IFNAME=eth0
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_DISABLE=0
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_IB_QPS_PER_CONNECTION=4
# export NCCL_IB_TC=160
# export NCCL_IB_TIMEOUT=22
# export NCCL_P2P=0
# export CUDA_DEVICE_MAX_CONNECTIONS=1

export PYTHONPATH=$PWD:$PYTHONPATH

# Select the model type
# The model is downloaded to a specified location on disk, 
# or you can simply use the model's ID on Hugging Face, 
# which will then be downloaded to the default cache path on Hugging Face.

export MODEL_TYPE="Flux"
# Configuration for different model types
# script, model_id, inference_step
declare -A MODEL_CONFIGS=(
    ["Pixart-alpha"]="pixartalpha_example.py /mnt/models/SD/PixArt-XL-2-1024-MS 20"
    ["Pixart-sigma"]="pixartsigma_example.py /cfs/dit/PixArt-Sigma-XL-2-2K-MS 20"
    ["Sd3"]="sd3_example.py /cfs/dit/stable-diffusion-3-medium-diffusers 20"
    ["Flux"]="flux_example.py black-forest-labs/FLUX.1-dev 20"
    ["HunyuanDiT"]="hunyuandit_example.py /mnt/models/SD/HunyuanDiT-v1.2-Diffusers 50"
    ["CogVideoX"]="cogvideox_example.py /cfs/dit/CogVideoX-2b 1"
)

if [[ -v MODEL_CONFIGS[$MODEL_TYPE] ]]; then
    IFS=' ' read -r SCRIPT MODEL_ID INFERENCE_STEP <<< "${MODEL_CONFIGS[$MODEL_TYPE]}"
    export SCRIPT MODEL_ID INFERENCE_STEP
else
    echo "Invalid MODEL_TYPE: $MODEL_TYPE"
    exit 1
fi

mkdir -p ./results

for HEIGHT in 1024
do
for N_GPUS in 8;
do 

# task args
if [ "$MODEL_TYPE" = "CogVideoX" ]; then
  TASK_ARGS="--height 480 --width 720 --num_frames 9"
else
  TASK_ARGS="--height $HEIGHT --width $HEIGHT --no_use_resolution_binning"
fi

# Flux only supports SP, do not set the pipefusion degree
if [ "$MODEL_TYPE" = "Flux" ] || [ "$MODEL_TYPE" = "CogVideoX" ]; then
PARALLEL_ARGS="--ulysses_degree $N_GPUS"
export CFG_ARGS=""
elif [ "$MODEL_TYPE" = "HunyuanDiT" ]; then
# HunyuanDiT asserts sp_degree <=2, or the output will be incorrect.
PARALLEL_ARGS="--pipefusion_parallel_degree 1 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
else
# On 8 gpus, pp=2, ulysses=2, ring=1, cfg_parallel=2 (split batch)
PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
fi

# By default, num_pipeline_patch = pipefusion_degree, and you can tune this parameter to achieve optimal performance.
# PIPEFUSION_ARGS="--num_pipeline_patch 8 "

# For high-resolution images, we use the latent output type to avoid runing the vae module. Used for measuring speed.
# OUTPUT_ARGS="--output_type latent"

# PARALLLEL_VAE="--use_parallel_vae"

# Another compile option is `--use_onediff` which will use onediff's compiler.
# COMPILE_FLAG="--use_torch_compile"

torchrun --nproc_per_node=$N_GPUS ./examples/$SCRIPT \
--model $MODEL_ID \
$PARALLEL_ARGS \
$TASK_ARGS \
$PIPEFUSION_ARGS \
$OUTPUT_ARGS \
--num_inference_steps $INFERENCE_STEP \
--seed 1 \
--warmup_steps 0 \
--prompt "a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres." \
$CFG_ARGS \
$PARALLLEL_VAE \
$COMPILE_FLAG
# seed 在 flux_example.py 中是有 manual_seed 的。

done
done
lixiang007666 commented 6 days ago

From @Eigensystem: 可能是由 kernel 选择引起的误差。

feifeibear commented 5 days ago

From @Eigensystem: 可能是由 kernel 选择引起的误差。

cuDNN会根据输入的形状和类型自动选择最优的算法。不同并行度导致使用kernel不同,从而生成图片有差异?

我觉得可以

  1. 确保每次运行时使用的cuDNN算法是确定性的。(可能没帮助) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

  2. 观察不同并行度cpu运行结果。可能需要试试用gloo后端运行xDiT。