zer0int / CLIP-txt2img-diffusers-scripts

Example scripts for using [my] fine-tuned CLIP models with HuggingFace 🤗
6 stars 0 forks source link

Unable to Generate Text in Certain Cases? #1

Open codyshen0000 opened 1 month ago

codyshen0000 commented 1 month ago

Hello, I have been using your text encoder following your script, but I noticed that in some cases, it still fails to generate text. How can I resolve this?

zer0int commented 1 month ago

Hi! My text encoder is not perfect - it is just better than the default CLIP-L. To improve the chances of correct text, you can:

  1. Prompt in very clear ways, e.g. a sign that says 'This is the text I want', i.e. point out that a certain thing has the text (via "a sign that says" or "with text ''"), and set apart the text via ' quotation marks.
  2. CLIP's max input tokens are 77, while T5's token limit is 256 / 512 - but the effective token limit the CLIP can tend to with its attention is reported to be about 20 tokens (roughly 1 token = 1 word for many of CLIP's tokens). If your text is much longer, CLIP might get 'confused' and may not be able to guide accurate text. In that case, it may be better to send separate prompts to T5 and CLIP, where you might want to send <some-prompt-text>, the cat is holding a rugged wooden sign to CLIP for style only, and append <some-prompt-text>, the cat is holding a rugged wooden sign, the sign says 'my text here' to the prompt for T5. I.e., leave the text generation part to T5 entirely.
  3. Optimizer: Experiment with optimizers. I noticed that dpmpp_2s_ancestral (Flux-Dev FP16) produces better results than the standard Euler optimizer. However, I observed that in ComfyUI - and ComfyUI has sophisticated methods for handling / weighting tokens 'in the background', so these results may not necessarily apply as-is for my basic command-line scripts.
  4. Try multiple random seeds. As from what I've tried, even for complex (but not overly long) prompts, there should be at least 1 in 5 images that has correct text (while the original CLIP-L is more in the order of 1 in 20 has correct text).

Hope that helps!

codyshen0000 commented 1 month ago

Thank you for your detailed suggestions! I have tried them, and they indeed solved most of my cases!