x-tabdeveloping / turftopic

Robust and fast topic models with sentence-transformers.
https://x-tabdeveloping.github.io/turftopic/
MIT License
8 stars 3 forks source link

Implement multilingual KeyNMF #10

Closed x-tabdeveloping closed 4 months ago

x-tabdeveloping commented 4 months ago

Rationale:

KeyNMF by default is not capable of multilingual topic modeling. This is due to the fact that the model can only label texts with keywords, that are in the text. This does not allow English labels on Spanish texts, and NMF is therefore likely to find them as different topics.

Solution:

We can fix this by allowing keywords to be selected from the whole vocabulary on each text instead of just the words that are in the corpus.

Interface:

KeyNMF could have one more parameter at initialisation that indicates whether the whole vocabulary should be used when extracting keywords. For example something like this:

model = KeyNMF(10, keyword_scope="corpus")
## OR
model = KeyNMF(10, keyword_scope="document")