Open Teriks opened 5 months ago
I believe adding a device
argument to the methods which build the embeddings, with which the user can specify the device the tensors are created on, would fix this. Instead of relying on pipe.device
to determine the device.
Perhaps an argument where device=None
defaults to the device of the pipeline pipe.device
, and when specified otherwise, use the user specified device.
Unfortunately, I do not think I have enough GPU memory to fully test SD3 using this method, I can modify this code and see how it works for the other methods.
This is working for SD3 for me with the mentioned changes
Model offloading sometimes causes black output with this change when using stabilityai/stable-diffusion-xl-base-1.0
, but not with other SDXL models I am testing. Very peculiar.
Particularly for the fp16
model variant, with dtype = float16
Perhaps there is more to do, or this is a bug elsewhere.
Hi, @Teriks , see my comments here: https://github.com/xhinker/sd_embed/pull/3 Thank You!
Hi, @Teriks I have added pipe.enable_model_cpu_offload()
support for Flux1
oh I see, this is probably needed elsewhere where there is a clip encoder involved for a device argument since they need to be on a GPU, I will see what happens with CPU gen later
I figured out how to manage the VRAM usage, and now Flux1 int8/bfloat8 use only 14GB VRAM with the sd_embed
, you may want to give it a try:
#%%
from diffusers import DiffusionPipeline, FluxTransformer2DModel
from torchao.quantization import quantize_, int8_weight_only
import torch
from sd_embed.embedding_funcs import get_weighted_text_embeddings_flux1
# model_path = "black-forest-labs/FLUX.1-schnell"
model_path = "/home/andrewzhu/storage_14t_5/ai_models_all/sd_hf_models/black-forest-labs/FLUX.1-dev_main"
transformer = FluxTransformer2DModel.from_pretrained(
model_path
, subfolder = "transformer"
, torch_dtype = torch.bfloat16
)
quantize_(transformer, int8_weight_only())
pipe = DiffusionPipeline.from_pretrained(
model_path
, transformer = transformer
, torch_dtype = torch.bfloat16
)
pipe.enable_model_cpu_offload()
#%%
prompt = """\
A dreamy, soft-focus photograph capturing a romantic Jane Austen movie scene,
in the style of Agnes Cecile. Delicate watercolors, misty background,
Regency-era couple, tender embrace, period clothing, flowing dress, dappled sunlight,
ethereal glow, gentle expressions, intricate lace, muted pastels, serene countryside,
timeless romance, poetic atmosphere, wistful mood, look at camera.
"""
prompt_embeds, pooled_prompt_embeds = get_weighted_text_embeddings_flux1(
pipe = pipe
, prompt = prompt
)
image = pipe(
prompt_embeds = prompt_embeds
, pooled_prompt_embeds = pooled_prompt_embeds
, width = 896
, height = 1280
, num_inference_steps = 20
, guidance_scale = 4.0
, generator = torch.Generator().manual_seed(1234)
).images[0]
display(image)
This model is too large for my hardware in this config (1080TI - 11GB), but it is okay since I have been able to run it using sequential offload which is good enough for testing. I really need new hardware :)
One thing I noticed about the code is that it might be good to have a user specified fallback device for the text encoders when device=cpu
In case someone has MPS or another backend available.
Torch may just auto convert the hard coded 'cuda', 'cuda:1' to some other available accelerator for compatibility but I am not sure.
If cuda is not available, but some other accelerator is available, it might fail unnecessarily.
Maybe a module global would do or another parameter.
Is there a way to support pipelines with CPU offloading enabled?
It seems currently unable to handle this condition
Result: