x-tabdeveloping / turftopic

Robust and fast topic models with sentence-transformers.
https://x-tabdeveloping.github.io/turftopic/
MIT License
17 stars 4 forks source link

E5Encoder, documentation dependencies #18

Closed jankounchained closed 6 months ago

jankounchained commented 7 months ago

Fixes:

15

Two simple E5 encoders were implemented: E5Encoder and E5InstructionalEncoder. The only difference is what prefix they give to documents. Documentation is updated.

16

Doc dependencies listed in pyproject.toml

17

Temporary fix: commented L45 out in mkdocs.yml (the custom_templates line) Site looks ok.

jankounchained commented 6 months ago

@x-tabdeveloping reworked the encoder, check it out

jankounchained commented 6 months ago

there's a new problem with documentation: newline characters break the code blocks ugh

what is should be:

    Examples
    --------
    Instructional models can also be used.
    In this case, the documents should be prefixed with a one-sentence instruction that describes the task.
    See Notes for available models and instruction suggestions.
from turftopic.encoders import E5Encoder

def add_instruct_prefix(document: str) -> str:
    task_description = "YOUR_INSTRUCTION"
    return f'Instruct: {task_description}\nQuery: {document}'

encoder = E5Encoder(model_name="intfloat/multilingual-e5-large-instruct", preprocessor=add_instruct_prefix)
model = GMM(10, encoder=encoder)
```

Or the same can be done using a `prefix` argument:
```python
from turftopic.encoders import E5Encoder
from turftopic import GMM

prefix = "Instruct: YOUR_INSTRUCTION\nQuery: "
encoder = E5Encoder(model_name="intfloat/multilingual-e5-large-instruct", prefix=prefix)
model = GMM(10, encoder=encoder)
```


what it is:
![Screenshot 2024-03-12 at 18 40 25](https://github.com/x-tabdeveloping/turftopic/assets/42962106/da0bbe57-aab8-4029-a721-cbcf947d27a9)

to be fixed
jankounchained commented 6 months ago

Site won't be rebuilt, code commited in another pull request