This is less of an issue and more of a comment/question.
Firstly, thank you for this piece of research. I believe augmentation is the Achilles heel of contrastive learning, and it is important to have a firm grip on it. Your theory goes a distance in explaining why certain augmentations may work while some may not. I have a couple questions -
Why do you measure the confusion ratio only on the latent/hidden layer outputs and not on the actual data? I understand you are trying to get a guarantee on the encoder, but why not do it on the actual dataas well? Am I missing something? This way one could measure confusion irrespective of the network/model.
Right now you measure augmentation strength by the parameters of the augmentation, and while it may work to understand it and characterise it, it is not generalisable, and specific to each augmentation. It will also work only on images, and not on text//tabular data. Have you thought of extending it such that different augmentations can be compared on a general scale? Would it be possible to connect it to mutual information, a. la. the Infomin hypothesis ( https://arxiv.org/abs/2005.10243 )?
Hello Qi,
This is less of an issue and more of a comment/question. Firstly, thank you for this piece of research. I believe augmentation is the Achilles heel of contrastive learning, and it is important to have a firm grip on it. Your theory goes a distance in explaining why certain augmentations may work while some may not. I have a couple questions -
Why do you measure the confusion ratio only on the latent/hidden layer outputs and not on the actual data? I understand you are trying to get a guarantee on the encoder, but why not do it on the actual data as well? Am I missing something? This way one could measure confusion irrespective of the network/model.
Right now you measure augmentation strength by the parameters of the augmentation, and while it may work to understand it and characterise it, it is not generalisable, and specific to each augmentation. It will also work only on images, and not on text//tabular data. Have you thought of extending it such that different augmentations can be compared on a general scale? Would it be possible to connect it to mutual information, a. la. the Infomin hypothesis ( https://arxiv.org/abs/2005.10243 )?
Regards Aditya