x-tabdeveloping / turftopic

Robust and fast topic models with sentence-transformers.
https://x-tabdeveloping.github.io/turftopic/
MIT License
17 stars 4 forks source link

Allow retrieval-specific embeddings in the `Encoder` API #23

Open x-tabdeveloping opened 6 months ago

x-tabdeveloping commented 6 months ago

Rationale:

Some embedding models, like E5 are trained in such a way that retrieval queries and passages are prefixed differently (or in other words you encode the queries and the passages with two different behaviours). This would be useful for Turftopic, as KeyNMF technically does a retrieval task, not clustering like other models. (We retrieve keywords based on the document as a query, it's an asymmetric task).

Implementation:

Similarly to embedding benchmarks we could implement an interface, where encoders are allowed to have an encode_queries() and an encode_passages() method. KeyNMF would then use encode_queries() to encode documents, and use encode_passages() for encoding the vocabulary if both attributes are present in the encoder, otherwise it could use the normal encode() method.

Here's how it would look:

from sentence_transformers import SentenceTransformer

class AssymmetricEncoder(SentenceTransformer):
    def encode(...):
        ...

    def encode_queries(queries: list[str]):
        return self.encode([f"query: {query}" for query in queries])

    def encode_passages(passages: list[str]):
        return self.encode([f"passage: {passage}" for passage in passages])

### Then in KeyNMF:
try:
    embeddings = self.encoder_.encode_queries(documents)
    vocab_embedings = self.encoder_.encode_passage(vocab)
except AttributeError:
    embeddings = self.encoder_.encode(documents)
    vocab_embeddings = self.encoder_.encode(vocab)