Open jin8 opened 3 years ago
Hi, I don't know where do you get your dataset files, but we obtain all STS datasets through the SentEval toolkit, and you can also download them using the script data/get_transfer_data.bash
.
For STS15 and STS16, we notice that the original datasets contain many unannotated sentence pairs (i.e. the paired texts are provided but the similarity score is missing). We also make use of the unlabeled texts from those unannotated samples to fine-tune our model in the unsupervised experiments (I wonder if this could be the reason why the numbers mismatch). For STS16, there are 9183 text pairs provided in total, resulting in 2 * 9183 = 18366 unlabeled texts (since each text pair contains two sentences). However, there are only 1379 text pairs with annotations among them, and we use these 1379 samples to test the trained model.
We did not conduct experiments that only sample one sentence from the text pairs, but I guess the results wouldn't be hurt too much. As we show in the few-shot experiments (Figure 6), the performance keeps comparable when training on 10000 texts (about 11% of the full dataset).
Hi, I have a question on the dataset. In section 4.1 setups, it mentions that for unsupervised experiment setting, unlabled texts from STS12-16 + STSb + SICK-R are used for training. I have looked through the dataset files. The number of samples of unlabeled text that I have found in STS16 (last row) does not match to your table 1. I found 8002.
In the paper, I was not able to find a mention on expanding the size of dataset by concatenating the given dataset, but the numbers make sense when I doubled the labeled (train/valid/test) samples and unlabeled samples. Did you doubled the size of the train/valid/test samples? And did it hurt your performance if it is not doubled?
Thank you, Jin