xhinker / sd_embed

Generate long weighted prompt embeddings for Stable Diffusion
Apache License 2.0
85 stars 9 forks source link

Does not work for long prompt (SD 1.5) #22

Closed larinius closed 1 month ago

larinius commented 1 month ago

I always get an error when prompts are longer than 77 tokens. When prompts are within the safe range, it works fine (same as Compel). It seems like this library doesn’t do anything different from Compel, which has a 77-token limit—the results are exactly the same.

Maybe I am doing something wrong, but I copy-pasted it exactly as in the documentation.

prompt_embeds, neg_prompt_embeds = get_weighted_text_embeddings_sd15(pipe=pipe, prompt=prompt, neg_prompt=negative_prompt)

 image = pipe(
                    prompt_embeds=prompt_embeds,
                    negative_embeds=negative_embeds,
                    num_inference_steps=num_inference_steps,
                    guidance_scale=guidance_scale,
                    width=(width - (width % 8)),
                    height=(height - (height % 8)),
                    output_type="pil",
                    generator=generator(i),
                )

Token indices sequence length is longer than the specified maximum sequence length for this model (113 > 77). Running this sequence through the model will result in indexing errors Error during image generation: The size of tensor a (115) must match the size of tensor b (77) at non-singleton dimension 1 Failed to generate image: Error during image generation: The size of tensor a (115) must match the size of tensor b (77) at non-singleton dimension 1

xhinker commented 1 month ago

I test the sample code again, it works, could you post your full code?

The sd_embede is completely different compared with Compel, you should get different result if not better

here is the code I test:

import gc
import torch
from diffusers import StableDiffusionPipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15

model_path = "stablediffusionapi/deliberate-v2"
pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16
)

pipe.to('cuda')

prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(
    prompt_embeds
    , prompt_neg_embeds
) = get_weighted_text_embeddings_sd15(
    pipe
    , prompt = prompt
    , neg_prompt = neg_prompt
)

image = pipe(
    prompt_embeds                   = prompt_embeds
    , negative_prompt_embeds        = prompt_neg_embeds
    , num_inference_steps           = 30
    , height                        = 768
    , width                         = 896
    , guidance_scale                = 8.0
    , generator                     = torch.Generator("cuda").manual_seed(2)
).images[0]
display(image)

del prompt_embeds, prompt_neg_embeds
pipe.to('cpu')
gc.collect()
torch.cuda.empty_cache()