xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Apache License 2.0
743 stars 56 forks source link

FLUX with SP 并行生成图像差异 #262

Closed lixiang007666 closed 2 days ago

lixiang007666 commented 2 months ago

问题描述

固定 seed 测了下,为了确认 seed 是固定的,先重复运行了多卡脚本,确保每次图像不变。

在这个条件下,不同卡数生成的图像:

image
flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0

可以看到在 1024 下是有损的,但是两卡的时候损失比较的小(也分seed)。

一个观察是 512 下损失会更大。

复现脚本:

set -x

# export NCCL_PXN_DISABLE=1
# # export NCCL_DEBUG=INFO
# export NCCL_SOCKET_IFNAME=eth0
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_DISABLE=0
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_IB_QPS_PER_CONNECTION=4
# export NCCL_IB_TC=160
# export NCCL_IB_TIMEOUT=22
# export NCCL_P2P=0
# export CUDA_DEVICE_MAX_CONNECTIONS=1

export PYTHONPATH=$PWD:$PYTHONPATH

# Select the model type
# The model is downloaded to a specified location on disk, 
# or you can simply use the model's ID on Hugging Face, 
# which will then be downloaded to the default cache path on Hugging Face.

export MODEL_TYPE="Flux"
# Configuration for different model types
# script, model_id, inference_step
declare -A MODEL_CONFIGS=(
    ["Pixart-alpha"]="pixartalpha_example.py /mnt/models/SD/PixArt-XL-2-1024-MS 20"
    ["Pixart-sigma"]="pixartsigma_example.py /cfs/dit/PixArt-Sigma-XL-2-2K-MS 20"
    ["Sd3"]="sd3_example.py /cfs/dit/stable-diffusion-3-medium-diffusers 20"
    ["Flux"]="flux_example.py black-forest-labs/FLUX.1-dev 20"
    ["HunyuanDiT"]="hunyuandit_example.py /mnt/models/SD/HunyuanDiT-v1.2-Diffusers 50"
    ["CogVideoX"]="cogvideox_example.py /cfs/dit/CogVideoX-2b 1"
)

if [[ -v MODEL_CONFIGS[$MODEL_TYPE] ]]; then
    IFS=' ' read -r SCRIPT MODEL_ID INFERENCE_STEP <<< "${MODEL_CONFIGS[$MODEL_TYPE]}"
    export SCRIPT MODEL_ID INFERENCE_STEP
else
    echo "Invalid MODEL_TYPE: $MODEL_TYPE"
    exit 1
fi

mkdir -p ./results

for HEIGHT in 1024
do
for N_GPUS in 8;
do 

# task args
if [ "$MODEL_TYPE" = "CogVideoX" ]; then
  TASK_ARGS="--height 480 --width 720 --num_frames 9"
else
  TASK_ARGS="--height $HEIGHT --width $HEIGHT --no_use_resolution_binning"
fi

# Flux only supports SP, do not set the pipefusion degree
if [ "$MODEL_TYPE" = "Flux" ] || [ "$MODEL_TYPE" = "CogVideoX" ]; then
PARALLEL_ARGS="--ulysses_degree $N_GPUS"
export CFG_ARGS=""
elif [ "$MODEL_TYPE" = "HunyuanDiT" ]; then
# HunyuanDiT asserts sp_degree <=2, or the output will be incorrect.
PARALLEL_ARGS="--pipefusion_parallel_degree 1 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
else
# On 8 gpus, pp=2, ulysses=2, ring=1, cfg_parallel=2 (split batch)
PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
fi

# By default, num_pipeline_patch = pipefusion_degree, and you can tune this parameter to achieve optimal performance.
# PIPEFUSION_ARGS="--num_pipeline_patch 8 "

# For high-resolution images, we use the latent output type to avoid runing the vae module. Used for measuring speed.
# OUTPUT_ARGS="--output_type latent"

# PARALLLEL_VAE="--use_parallel_vae"

# Another compile option is `--use_onediff` which will use onediff's compiler.
# COMPILE_FLAG="--use_torch_compile"

torchrun --nproc_per_node=$N_GPUS ./examples/$SCRIPT \
--model $MODEL_ID \
$PARALLEL_ARGS \
$TASK_ARGS \
$PIPEFUSION_ARGS \
$OUTPUT_ARGS \
--num_inference_steps $INFERENCE_STEP \
--seed 1 \
--warmup_steps 0 \
--prompt "a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres." \
$CFG_ARGS \
$PARALLLEL_VAE \
$COMPILE_FLAG
# seed 在 flux_example.py 中是有 manual_seed 的。

done
done
lixiang007666 commented 2 months ago

From @Eigensystem: 可能是由 kernel 选择引起的误差。

feifeibear commented 2 months ago

From @Eigensystem: 可能是由 kernel 选择引起的误差。

cuDNN会根据输入的形状和类型自动选择最优的算法。不同并行度导致使用kernel不同,从而生成图片有差异?

我觉得可以

  1. 确保每次运行时使用的cuDNN算法是确定性的。(可能没帮助) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

  2. 观察不同并行度cpu运行结果。可能需要试试用gloo后端运行xDiT。

fengchuanBIG commented 3 weeks ago

就是说现在flux 也没办法完美使用,生图会有瑕疵是吗,楼主有解决吗

feifeibear commented 3 weeks ago

就是说现在flux 也没办法完美使用,生图会有瑕疵是吗,楼主有解决吗

我们研究了这个问题,不能叫生图有瑕疵。对于一个单独的attention算子,并行和非并行的结果都是有diff的。并行计算加乘的顺序就是不一样的,数值不可能完全一模一样。所以flux使用usp并行生成的结果和单卡不等价是正常。我们观察的生成图片,不比原来的差。二者都是正确的。

fengchuanBIG commented 3 weeks ago

好的感谢回复,但是我看到好像现在还不支持lora模型一起使用,这个能解决么? 没有lora的话还是没法使用

feifeibear commented 3 weeks ago

好的感谢回复,但是我看到好像现在还不支持lora模型一起使用,这个能解决么? 没有lora的话还是没法使用

这个支持起来很容易。我们发现大部分用户都用comfyui使用lora,你可以看我们comfyui的demo早就支持lora了。

fengchuanBIG commented 3 weeks ago

好的感谢回复,但是我看到好像现在不太支持lora模型一起使用,这个能解决么? 没有lora的话还是不能使用

这个支持起来很容易。我们发现大多数用户都使用 comfyui 使用lora,你可以看我们comfyui 的演示很快支持lora了。

好的 感谢回复 马上去试一试