about cluster loss - Githubissues

akidway commented 5 months ago

Hi, thank you for your nice work. In this paper, I understand Equation (3) and (4) as assigning Z to cluster center and forming the cluster center, respectively. But I'm a little confused about Equation (5) which is used for calculating the mean of the off-diagonal elements. I wonder if you could provide an intuitive explanation for why calculating the mean of the off-diagonal elements.

wwyi1828 commented 5 months ago

Hi,

Thanks for your question!

The intuition behind calculating the mean of the off-diagonal elements in Equation (5) is to encourage the cluster centroids to be far apart and uniformly distributed in the representation space. The off-diagonal elements represent the pairwise cosine similarities between different centroids. By minimizing the mean of these values, we are essentially pushing the centroids away from each other on the hypersphere. This encourages well-separated centroid representations that are evenly distributed in the space.

In classical contrastive learning frameworks, such as MoCo and SimCLR, which model instance discrimination and instance alignment together directly using positive and negative pairs. Knowledge-distillation-like methods, such as BYOL and SimSiam, only consider positive pairs and focus solely on instance alignment without explicitly modeling instance discrimination.

Minimizing the mean of the off-diagonal elements in Equation (5) can be seen as an indirect way of reintroducing instance discrimination modeling. Pushing cluster centroids away from each other inherently pushes apart instances belonging to different centroids. This approach complements the original invariance loss without conflicting with it.

Hope this clarifies the idea behind Equation (5). Let me know if you have any other questions!

akidway commented 5 months ago

Thank you for your detailed explanation.

Your clarification provided me with a clear understanding of Equation (5). By minimizing the cluster loss, a batch of representations forms k cluster centroids, which are pushed away from each other.

Additionally, while cluster centroids are formed within a batch, is there a specific requirement for the batch size? For instance, should the batch size be as large as possible?

wwyi1828 commented 5 months ago

Thank you very much for your question! Regarding the batch size, here are my thoughts:

Although the cluster centroids are computed online within the current batch, the assigner's parameters actually carry some information about the historical clustering distribution. Therefore, even if the batch size is not large, the training process can still capture the overall distribution characteristics of the data.

In my experiments, the smallest batch size I tried was 256. Compared to 512 batch size, the performance decreased, but the training process could still proceed normally. I haven't tried smaller batch sizes, but intuitively, larger batch sizes can make the training more stable.

akidway commented 5 months ago

Thank you very much.

wwyi1828 / CluSiam

about cluster loss #1