Some embedding models, like E5 are trained in such a way that retrieval queries and passages are prefixed differently (or in other words you encode the queries and the passages with two different behaviours).
This would be useful for Turftopic, as KeyNMF technically does a retrieval task, not clustering like other models. (We retrieve keywords based on the document as a query, it's an asymmetric task).
Implementation:
Similarly to embedding benchmarks we could implement an interface, where encoders are allowed to have an encode_queries() and an encode_passages() method. KeyNMF would then use encode_queries() to encode documents, and use encode_passages() for encoding the vocabulary if both attributes are present in the encoder, otherwise it could use the normal encode() method.
Here's how it would look:
from sentence_transformers import SentenceTransformer
class AssymmetricEncoder(SentenceTransformer):
def encode(...):
...
def encode_queries(queries: list[str]):
return self.encode([f"query: {query}" for query in queries])
def encode_passages(passages: list[str]):
return self.encode([f"passage: {passage}" for passage in passages])
### Then in KeyNMF:
try:
embeddings = self.encoder_.encode_queries(documents)
vocab_embedings = self.encoder_.encode_passage(vocab)
except AttributeError:
embeddings = self.encoder_.encode(documents)
vocab_embeddings = self.encoder_.encode(vocab)
Rationale:
Some embedding models, like E5 are trained in such a way that retrieval queries and passages are prefixed differently (or in other words you encode the queries and the passages with two different behaviours). This would be useful for Turftopic, as KeyNMF technically does a retrieval task, not clustering like other models. (We retrieve keywords based on the document as a query, it's an asymmetric task).
Implementation:
Similarly to embedding benchmarks we could implement an interface, where encoders are allowed to have an
encode_queries()
and anencode_passages()
method. KeyNMF would then useencode_queries()
to encode documents, and useencode_passages()
for encoding the vocabulary if both attributes are present in the encoder, otherwise it could use the normalencode()
method.Here's how it would look: