I think one word has multiple contextualized embeddings in the corpus. How do you deal with that?

yumeng5 / TopClus

[WWW 2022] Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

Apache License 2.0

85 stars 11 forks source link

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

Closed namespace-Pt closed 2 years ago

namespace-Pt commented 2 years ago

I like your paper but I think it's confusing that how to tackle multiple embeddings of the same word/token. I wonder is there any chance that different embeddings of the same word are mapped to different clusters and all of them are quite close to the cluster center in the spherical space. How do you deal with that?

yumeng5 commented 2 years ago

Hi @namespace-Pt ,

Thanks for the question. You are right that each word can have multiple contextualized embeddings, and they can be mapped to different clusters during the clustering step in our algorithm. However, when deriving the final results, we take the average of the latent contextualized embeddings as the (context-free) representation for each word, which is then used for computing the topic-word distribution.

I hope this helps. Please let me know if anything remains unclear.

Best, Yu

namespace-Pt commented 2 years ago

Ok, I got it, thank you.

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''? Did you reweight the averaged token embeddings? Also, how do you deal with subwords?

yumeng5 commented 2 years ago

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''?

Yes. The difference is that TopClus uses contextualized embeddings (instead of context-free embeddings as in that paper) for clustering.

Did you reweight the averaged token embeddings?

No, we do not have any reweighing steps.

Also, how do you deal with subwords?

We remove subwords from the vocabulary when deriving the final results, so our results will not contain subwords.

namespace-Pt commented 2 years ago

Thank you.