Clustering topic models by default and usually have a clustering method that can learn the number of topics from the data.
This is nice, but a lot of times it results in too many topics. Both BERTopic and Top2Vec have methods for reducing the number of topics to a desired amount.
Details
The two approaches seem to be the following, the both the documentation and code in both repositories is incredibly cryptic around this (plus the papers have no info about it).
BERTopic
Run Agglomerative Clustering on the cluster centroids (topic vectors), then merge all topics that belong to the same cluster.
I found this in the paper:
Finally, by iteratively merging the c-TF-IDF representations of the least common topic with its most
similar one, we can reduce the number of topics to
a user-specified value.
(Like what the hell does this mean)
As far as I understand from digging in the code what they do is simply recalculate the c-TF-IDF representations with the new merged labels, and that's it.
Top2Vec
The algorithm seems to be something like this:
If the number of topics is already small enough, throw an error
Until the number of topics is small enough:
Find the smallest topic
Find the closest topic to this (by dot product?)
Merge by taking a weighted average of the two topic vectors combined_vec = self._l2_normalize(((top_vec_smallest *smallest_size) + (top_vec_most_sim * most_sim_size)) / (smallest_size + most_sim_size))
Recalculate term importances
Implementation
I think the most sensible option would be to make this part of the fit() method, as you still have access to the predicted labels.
The models could have a n_reduce_to parameter passed at initialization like this:
model = ClusteringTopicModel(n_reduce_to=10)
Additionally we could have a method that can reduce the topic when given the document-topic matrix. I think it would probably be smartest if it returned a new model instead of modifying the original one:
model = ClusteringTopicModel()
doc_topic_matrix = model.fit_transform(corpus)
reduced_model = model.reduce_topics(10, doc_topic_matrix)
To specify which method users want to use for topic reduction we should have a merging_method class attribute, with the possible values of "agglomerative_clustering", "merge_smallest"
Rationale:
Clustering topic models by default and usually have a clustering method that can learn the number of topics from the data. This is nice, but a lot of times it results in too many topics. Both BERTopic and Top2Vec have methods for reducing the number of topics to a desired amount.
Details
The two approaches seem to be the following, the both the documentation and code in both repositories is incredibly cryptic around this (plus the papers have no info about it).
BERTopic
Run Agglomerative Clustering on the cluster centroids (topic vectors), then merge all topics that belong to the same cluster. I found this in the paper:
(Like what the hell does this mean)
As far as I understand from digging in the code what they do is simply recalculate the c-TF-IDF representations with the new merged labels, and that's it.
Top2Vec
The algorithm seems to be something like this:
combined_vec = self._l2_normalize(((top_vec_smallest *smallest_size) + (top_vec_most_sim * most_sim_size)) / (smallest_size + most_sim_size))
Implementation
I think the most sensible option would be to make this part of the
fit()
method, as you still have access to the predicted labels. The models could have an_reduce_to
parameter passed at initialization like this:Additionally we could have a method that can reduce the topic when given the document-topic matrix. I think it would probably be smartest if it returned a new model instead of modifying the original one:
To specify which method users want to use for topic reduction we should have a
merging_method
class attribute, with the possible values of"agglomerative_clustering"
,"merge_smallest"