Currently we only have Soft c-TF-IDF implemented in the package, which is a "generalization" of c-TD-IDF, but unfortunately not identical.
This is not a huge issue, since as far as I understand the values are monotonic with each other, meaning that this does not at all influence topic descriptions.
The reason it would be nice to have it is to be able to replicate BERTopic's behaviour exactly in the package.
Implementation
ClusteringTopicModel should have the following feature importance values as options: soft-c-tf-idf, c-tf-idf, centroid.
We should merge the soft_ctf_idf.py and centroid_distance.py files into one post_hoc_importance.py module or feature_importance or whatever, where we have all three methods for post-hoc importance estimation.
Rationale
Currently we only have Soft c-TF-IDF implemented in the package, which is a "generalization" of c-TD-IDF, but unfortunately not identical. This is not a huge issue, since as far as I understand the values are monotonic with each other, meaning that this does not at all influence topic descriptions.
The reason it would be nice to have it is to be able to replicate BERTopic's behaviour exactly in the package.
Implementation
ClusteringTopicModel
should have the following feature importance values as options:soft-c-tf-idf
,c-tf-idf
,centroid
. We should merge thesoft_ctf_idf.py
andcentroid_distance.py
files into onepost_hoc_importance.py
module orfeature_importance
or whatever, where we have all three methods for post-hoc importance estimation.