Question on table 1 in ACL 2021 paper

yym6472 / ConSERT

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

539 stars 81 forks source link

Hi, I don't know where do you get your dataset files, but we obtain all STS datasets through the SentEval toolkit, and you can also download them using the script data/get_transfer_data.bash.

For STS15 and STS16, we notice that the original datasets contain many unannotated sentence pairs (i.e. the paired texts are provided but the similarity score is missing). We also make use of the unlabeled texts from those unannotated samples to fine-tune our model in the unsupervised experiments (I wonder if this could be the reason why the numbers mismatch). For STS16, there are 9183 text pairs provided in total, resulting in 2 * 9183 = 18366 unlabeled texts (since each text pair contains two sentences). However, there are only 1379 text pairs with annotations among them, and we use these 1379 samples to test the trained model.

We did not conduct experiments that only sample one sentence from the text pairs, but I guess the results wouldn't be hurt too much. As we show in the few-shot experiments (Figure 6), the performance keeps comparable when training on 10000 texts (about 11% of the full dataset).

yym6472 / ConSERT

Question on table 1 in ACL 2021 paper #6