add device argument to all methods

https://github.com/xhinker/sd_embed/issues/2

User can specify a device for tensor creation, this allows these methods to work with pipelines offloaded by accelerate

https://github.com/Teriks/sd_embed/blob/main/src/sd_embed/embedding_funcs.py

I have attempted to preserve code formatting as best as possible, though my code editor really wants to modify it, I believe it has removed extraneous whitespace :)

When a device is not manually specified, the functions revert to the old behavior of using pipe.device

import gc
import torch
from diffusers import StableDiffusionXLPipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl

model_path = "Lykon/dreamshaper-xl-1-0"
pipe = StableDiffusionXLPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16
)

pipe.enable_sequential_cpu_offload(device='cuda')

prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(
    prompt_embeds
    , prompt_neg_embeds
    , pooled_prompt_embeds
    , negative_pooled_prompt_embeds
) = get_weighted_text_embeddings_sdxl(
    pipe
    , prompt = prompt
    , neg_prompt = neg_prompt
    , device = 'cuda'
)

image = pipe(
    prompt_embeds                   = prompt_embeds
    , negative_prompt_embeds        = prompt_neg_embeds
    , pooled_prompt_embeds          = pooled_prompt_embeds
    , negative_pooled_prompt_embeds = negative_pooled_prompt_embeds
    , num_inference_steps           = 30
    , height                        = 1024 
    , width                         = 1024
    , guidance_scale                = 4.0
    , generator                     = torch.Generator("cuda").manual_seed(2)
).images[0]

image.save('test.png')

del prompt_embeds, prompt_neg_embeds,pooled_prompt_embeds, negative_pooled_prompt_embeds
gc.collect()
torch.cuda.empty_cache()

Caveat *

Model offloading sometimes causes black output with this change when using stabilityai/stable-diffusion-xl-base-1.0, but not with other SDXL models I am testing. Very peculiar.

Particularly for the fp16 model variant, with dtype = float16

Perhaps there is more to do, or this is a bug elsewhere.

Thanks for the PR! I pulled your code and found that the change will cause CUDA run out of memory for image (1024 x 1536), while the same code runs fine without pipe.enable_sequential_cpu_offload. Could you help take a look?

BTW, the enable_sequential_cpu_offload() is not a very efficient way to use, which will slow down the speed, while not saving much VRAM.

Instead, if we use pipe.enable_model_cpu_offload(gpu_id=0, device='cuda') the current code support it already :) . you may give it a try.

I can investigate large images further, I noticed this as well.

On the my hardware (1080TIs) enable_model_cpu_offload cannot successfully run SD3 with all text encoders without an immediate CUDA OOM, which is why I am interested in this feature :)

It is a bit difficult to test as I cannot duplicate the working condition you have, of such a large image with enable_model_cpu_offload only, due to my hardware.

Sequentially offloading does run rather slow, but it allows for running the model on lesser / older GPUs

There might be memory optimizations that can be done, such as cleaning up intermediate tensors that no longer need to be on the GPU as soon as possible.

running these methods within a torch.no_grad() context is one optimization, it does not work on my hardware in the edited state without being inside no_grad.

which I had missed.

prehaps these methods should use the no_grad decorator, it might improve their memory consumption by a bit.

without it, it fails to create the embeddings on my hardware when all of the tensors are created on the gpu in the method, due to the memory consumption.

there might be opportunity to move intermediate tensors back to the cpu in the middle of the calculations but I’ll need to look a little closer.

I have optimized the lifetime of the tensors on the GPU that are involved in generating the embeds for all functions, this involves a combination of using torch.no_grad, moving the tensors back to CPU after calculation where possible, destructing / dereferencing tensors (and objects that refer to tensors), and clearing the CUDA cache.

The memory management occurs throughout the functions as well as near the end, in order to keep the overall VRAM usage low during embedding generation.

I think this should help clear out anything hanging around on the GPU that might cause the pipeline to OOM after using these functions.

I will test this over the course of this week.

I am attempting to vendor this code (with citation) and integrate it into my command line tool, this is exactly what I was looking for so I hope it gets to PyPI eventually :)

Additionally, the black images I had mentioned experiencing with Lykon/dreamshaper-xl-1-0 variant fp16 dtype float16 are fixed by using the VAE madebyollin/sdxl-vae-fp16-fix. So the black images are unrelated to the code in this repository thankfully, as this is a known issue elsewhere I believe.

I am now able to create a large image of size 1024x1536 with SD3 + T5 Encoder on my 1080ti using the changes I have made.

However I cannot test if the output differs from the original sd_embed code without enable_sequential_cpu_offload because I am unable to run it in that manner with the T5 encoder present, or withenable_model_cpu_offload, due to having inadequate VRAM.

I am only able to test the original code locally with no offloading by disabling the T5 encoder, and the output appears identical in those tests.

When you have time, if you could generate an image with the original sd_embed code from this repository, without sequential offload.

And an image with the code from my fork, using sequential offload.

Then check for output differences, that would help test the changes.

If the output is the same and there is no issues running on your hardware with no offloading, as well as with enable_model_cpu_offload, and also with enable_sequential_cpu_offload, then I think this is okay to merge in its current state if you are interested.

Here is the test code I am using to generate a large image using my fork, with SD3.

import gc
import torch
from diffusers import StableDiffusion3Pipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd3

device = 'cuda'

model_path = "stabilityai/stable-diffusion-3-medium-diffusers"
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16
)

pipe.enable_sequential_cpu_offload(device=device)

prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(
    prompt_embeds
    , prompt_neg_embeds
    , pooled_prompt_embeds
    , negative_pooled_prompt_embeds
) = get_weighted_text_embeddings_sd3(
    pipe
    , prompt = prompt
    , neg_prompt = neg_prompt
    , device = device
)

image = pipe(
    prompt_embeds                   = prompt_embeds
    , negative_prompt_embeds        = prompt_neg_embeds
    , pooled_prompt_embeds          = pooled_prompt_embeds
    , negative_pooled_prompt_embeds = negative_pooled_prompt_embeds
    , num_inference_steps           = 30
    , height                        = 1024
    , width                         = 1024 + 512
    , guidance_scale                = 4.0
    , generator                     = torch.Generator(device).manual_seed(2)
).images[0]

image.save('test-image.png')

del prompt_embeds, prompt_neg_embeds,pooled_prompt_embeds, negative_pooled_prompt_embeds
gc.collect()
torch.cuda.empty_cache()

Which produces this image:

test-image

xhinker / sd_embed

add device argument to all methods #3