x-tabdeveloping / topicwizard

Powerful topic model visualization in Python
https://x-tabdeveloping.github.io/topicwizard/
MIT License
101 stars 13 forks source link

Error with Gensim's NMF model #24

Closed MaggieMeow closed 9 months ago

MaggieMeow commented 9 months ago

Hi, there. First, I'd like to say that your package has been very helpful for the visualisation in my topic modelling project. Thank you!

While topic wizard works great with Gensim's lda model for me, running it on gensim's nmf model produced the error "AttributeError: 'Nmf' object has no attribute 'inference'" from topicwizard.visualize().

Here's the full error message:

{
    "name": "AttributeError",
    "message": "'Nmf' object has no attribute 'inference'",
    "stack": "---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/Users/magz/workspace/asia3012/topic-model-comp.ipynb Cell 9 line 3
      <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> topic_pipeline = topicwizard.gensim_pipeline(dictionary, model=nmf)
      <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=1'>2</a> texts = [\" \".join(text) for text in df['cleanedText']]
----> <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a> topicwizard.visualize(texts, pipeline=topic_pipeline)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:297, in visualize(corpus, vectorizer, topic_model, pipeline, document_names, topic_names, exclude_pages, group_labels, port)
    295 if topic_names is None and hasattr(pipeline, \"topic_names\"):
    296     topic_names = pipeline.topic_names  # type: ignore
--> 297 app = get_dash_app(
    298     pipeline=pipeline,
    299     corpus=corpus,
    300     document_names=document_names,
    301     topic_names=topic_names,
    302     exclude_pages=exclude_pages,
    303     group_labels=group_labels,
    304 )
    305 return run_app(app, port=port)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:88, in get_dash_app(corpus, exclude_pages, pipeline, vectorizer, topic_model, document_names, topic_names, group_labels)
     86 if pipeline is None:
     87     pipeline = Pipeline([(\"Vectorizer\", vectorizer), (\"Model\", topic_model)])
---> 88 blueprint = get_app_blueprint(
     89     pipeline=pipeline,
     90     corpus=corpus,
     91     document_names=document_names,
     92     topic_names=topic_names,
     93     exclude_pages=exclude_pages,
     94     group_labels=group_labels,
     95 )
     96 app = Dash(
     97     __name__,
     98     blueprint=blueprint,
   (...)
    108     ],
    109 )
    110 return app

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:32, in get_app_blueprint(pipeline, corpus, document_names, topic_names, *args, **kwargs)
     24 def get_app_blueprint(
     25     pipeline: Pipeline,
     26     corpus: Iterable[str],
   (...)
     30     **kwargs,
     31 ) -> DashBlueprint:
---> 32     blueprint = prepare_blueprint(
     33         pipeline=pipeline,
     34         corpus=corpus,
     35         document_names=document_names,
     36         topic_names=topic_names,
     37         create_blueprint=create_blueprint,
     38         *args,
     39         **kwargs,
     40     )
     41     return blueprint

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/blueprints/template.py:41, in prepare_blueprint(pipeline, corpus, create_blueprint, document_names, topic_names, group_labels, *args, **kwargs)
     35 vectorizer, topic_model = split_pipeline(None, None, pipeline)
     36 vocab = get_vocab(vectorizer)
     37 (
     38     document_term_matrix,
     39     document_topic_matrix,
     40     topic_term_matrix,
---> 41 ) = prepare_transformed_data(vectorizer, topic_model, corpus)
     42 nan_documents = np.isnan(document_topic_matrix).any(axis=1)
     43 n_nan_docs = np.sum(nan_documents)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/prepare/utils.py:31, in prepare_transformed_data(vectorizer, topic_model, corpus)
     13 \"\"\"Transforms corpus with the topic model, and extracts important matrices.
     14 
     15 Parameters
   (...)
     28 topic_term_matrix: array of shape (n_topics, n_terms)
     29 \"\"\"
     30 document_term_matrix = vectorizer.transform(corpus)
---> 31 document_topic_matrix = topic_model.transform(document_term_matrix)
     32 topic_term_matrix = topic_model.components_
     33 return document_term_matrix, document_topic_matrix, topic_term_matrix

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/compatibility/gensim.py:162, in TopicModelWrapper.transform(self, X)
    149 \"\"\"Turns documents into topic distributions.
    150 
    151 Parameters
   (...)
    159     Sparse array of document-topic distributions.
    160 \"\"\"
    161 corpus = self._prepare_corpus(X)
--> 162 X_trans = self.model.inference(corpus)[0]
    163 # Normalizing probabilities (so that all docs add up to one)
    164 X_trans = (X_trans.T / X_trans.sum(axis=1)).T

AttributeError: 'Nmf' object has no attribute 'inference'"
}

Here's my code for reproducing the error:

dictionary = Dictionary(df['cleanedText'])
tfidf = TfidfModel(dictionary=dictionary)
corpus = [dictionary.doc2bow(doc) for doc in df['cleanedText']]
corpus_tfidf = tfidf[corpus]

nmf = GensimNmf(
    corpus=corpus_tfidf,
    num_topics=10,
    id2word=dictionary,
    chunksize=1000,
    passes=5,
    eval_every=10,
    minimum_probability=0,
    random_state=0,
    kappa=1,
)

topic_pipeline = topicwizard.gensim_pipeline(dictionary, model=nmf)
texts = [" ".join(text) for text in df['cleanedText']]
topicwizard.visualize(texts, pipeline=topic_pipeline)

Your help will be much appreciated!

x-tabdeveloping commented 9 months ago

Hello Maggie, thanks for the kind words, and reporting the issue, I will get to work on this and try to get it fixed right away. :D

In the meantime you could try using scikit-learn's implementation of NMF, as I know for a fact that that one works just fine.

It would look something akin to this:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline

texts = [" ".join(text) for text in df["cleanedText"]]

# You can also use MiniBatchNMF with partial_fit() if you run out of memory
topic_pipeline = make_pipeline(TfidfVectorizer(), NMF(10))
topic_pipeline.fit(texts)

topicwizard.visualize(texts, pipeline=topic_pipeline)
x-tabdeveloping commented 9 months ago

If you do not want to refit the model, you can also turn a Gensim Nmf model into an sklearn pipeline like this:

from topicwizard.compatibility.gensim import DictionaryVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline

sknmf = NMF(10)
sknmf.components_ = nmf.get_topics()
vectorizer = DictionaryVectorizer(dictionary)

topic_pipeline = make_pipeline(vectorizer, sknmf)
...
MaggieMeow commented 9 months ago

Thank you very much for your suggestions!

x-tabdeveloping commented 9 months ago

Anytime. The fix is on the main branch now, but it comes bundled with a bunch of other changes that I have made to accomodate for contextually sensitive topic models, that are coming with the turftopic package.

The next release of topicwizard will ship with the fix.