Implement topic merging in Clustering models.

Rationale:

Clustering topic models by default and usually have a clustering method that can learn the number of topics from the data. This is nice, but a lot of times it results in too many topics. Both BERTopic and Top2Vec have methods for reducing the number of topics to a desired amount.

Details

The two approaches seem to be the following, the both the documentation and code in both repositories is incredibly cryptic around this (plus the papers have no info about it).

BERTopic

Run Agglomerative Clustering on the cluster centroids (topic vectors), then merge all topics that belong to the same cluster. I found this in the paper:

Finally, by iteratively merging the c-TF-IDF representations of the least common topic with its most similar one, we can reduce the number of topics to a user-specified value.

(Like what the hell does this mean)

As far as I understand from digging in the code what they do is simply recalculate the c-TF-IDF representations with the new merged labels, and that's it.

Top2Vec

The algorithm seems to be something like this:

If the number of topics is already small enough, throw an error
Until the number of topics is small enough:
- Find the smallest topic
- Find the closest topic to this (by dot product?)
- Merge by taking a weighted average of the two topic vectors combined_vec = self._l2_normalize(((top_vec_smallest *smallest_size) + (top_vec_most_sim * most_sim_size)) / (smallest_size + most_sim_size))
- Recalculate term importances

Implementation

I think the most sensible option would be to make this part of the fit() method, as you still have access to the predicted labels. The models could have a n_reduce_to parameter passed at initialization like this:

model = ClusteringTopicModel(n_reduce_to=10)

Additionally we could have a method that can reduce the topic when given the document-topic matrix. I think it would probably be smartest if it returned a new model instead of modifying the original one:

model = ClusteringTopicModel()
doc_topic_matrix = model.fit_transform(corpus)

reduced_model = model.reduce_topics(10, doc_topic_matrix)

To specify which method users want to use for topic reduction we should have a merging_method class attribute, with the possible values of "agglomerative_clustering", "merge_smallest"

x-tabdeveloping / turftopic