yym6472 / ConSERT

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
539 stars 81 forks source link

Question on table 1 in ACL 2021 paper #6

Open jin8 opened 3 years ago

jin8 commented 3 years ago

Hi, I have a question on the dataset. In section 4.1 setups, it mentions that for unsupervised experiment setting, unlabled texts from STS12-16 + STSb + SICK-R are used for training. I have looked through the dataset files. The number of samples of unlabeled text that I have found in STS16 (last row) does not match to your table 1. I found 8002.
In the paper, I was not able to find a mention on expanding the size of dataset by concatenating the given dataset, but the numbers make sense when I doubled the labeled (train/valid/test) samples and unlabeled samples. Did you doubled the size of the train/valid/test samples? And did it hurt your performance if it is not doubled?

Thank you, Jin

yym6472 commented 3 years ago

Hi, I don't know where do you get your dataset files, but we obtain all STS datasets through the SentEval toolkit, and you can also download them using the script data/get_transfer_data.bash.

For STS15 and STS16, we notice that the original datasets contain many unannotated sentence pairs (i.e. the paired texts are provided but the similarity score is missing). We also make use of the unlabeled texts from those unannotated samples to fine-tune our model in the unsupervised experiments (I wonder if this could be the reason why the numbers mismatch). For STS16, there are 9183 text pairs provided in total, resulting in 2 * 9183 = 18366 unlabeled texts (since each text pair contains two sentences). However, there are only 1379 text pairs with annotations among them, and we use these 1379 samples to test the trained model.

We did not conduct experiments that only sample one sentence from the text pairs, but I guess the results wouldn't be hurt too much. As we show in the few-shot experiments (Figure 6), the performance keeps comparable when training on 10000 texts (about 11% of the full dataset).