xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Phrase embeddings in context #108

Open jnferfer opened 4 months ago

jnferfer commented 4 months ago

Hi,

I need to get the embeddings of a word or a phrase within a sentence. This sentence is the context of the word/phrase.

For example, I need the different embedding values of big apple in these two sentences:

I'm living in the Big Apple since 2012 I ate a big apple yesterday

When using model.encode() I can set the parameter output_value to token_embeddings to get token embeddings. However, I don't know how to properly map the output vectors to the target tokens corresponding to the big apple text. Is there a straightforward approach for this?

Thanks!

hongjin-su commented 2 months ago

You may first check the tokenization of the sentences, record the indices of desired words, e.g., big apple, and find token embeddings following the indices.

jnferfer commented 2 months ago

Thanks! Then, if I want to get a single embedding for "big apple", how should I proceed? I'm trying to get the average embedding of "big" and "apple", but I sometimes get odd results when comparing the average embedding against others.