xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Discrepancy in training data versions #107

Open vaibhavad opened 5 months ago

vaibhavad commented 5 months ago

Thank you for the great work and releasing the datasets and models. I downloaded the MEDI dataset few months ago and the length of the dataset in that file is 1435000

When I download it today, the dataset size is 1240000.

What is the difference between these two versions? Are there some samples which have been discarded? If so, where they from any specific dataset? Have any new samples been added?