Closed nhw649 closed 6 months ago
CLIP encodes text with a fixed length. token_prefix_pos and token_suffix_pos represent fixed embeddings before and after learnable text embedding.
The 'X X X... X' is only used to hold the positions of learnable text embedding to input it into the text encoder for the whole text embedding. After that, the corresponding positions of the whole text embedding will be replaced with learnable text embedding.
CLIP encodes text with a fixed length. token_prefix_pos and token_suffix_pos represent fixed embeddings before and after learnable text embedding.
The 'X X X... X' is only used to hold the positions of learnable text embeddings to input text encoder for the whole text embedding. After that, the corresponding positions of the whole text embedding will be replaced with leanable text embedding.
ok, thanks.
I can't understand the following code.
I think the positive prompt should be ['X X X X X X X X X X X X object.'], so the prefix should be 'X X X X X X X X X X X X ', and the suffix should be '.'. I don't know if my understanding is wrong, can you help me to answer it?