Data & training details

jordane95 commented 1 year ago

Hi, awesome work on text embeddings!

After reading your paper and the code, I have a few questions.

As stated in the paper (Section 2.3 data construction),

Following Ni et al. (2021), we use four negative pairs (hard or in-batch negatives) during the model finetuning process

But in the data downloaded from the link of your repo, each training instance from each task is accompanied with exactly 1 positive and 1 negative. Since some datasets from embedding-training-data do not contain negatives, I'm wondering how the negatives are sampled. Randomly or following the same way as superNI datasets? Also the data construction code for 300 datasets from superNI is missing.

In addition, I think the current ckpt is different from the first released ones since it's trained by hard negatives. But details about how hard negatives are sampled is missing...

Finally, many tasks are subsampled according to the paper to balance each dataset, would you mind sharing the whole data for each data source with all the hard negatives? Thanks.

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

In the MEDI data, we have one positive and one hard negative for each instance. In the training, each instance will use its own negative as hard negative, and other instance's negative in the same batch as in-batch negative. We describe the construction of MEDI datasets in the section 2.3 of our paper (including the mechanisms of hard negative mining).

In the section 2.3 and table 5 of our paper, we have also listed all the data sources used to construct the MEDI data.

Feel free to add any further questions or comments!

hongjin-su commented 11 months ago

Please re-open the issue if you have any questions or comments!

xlang-ai / instructor-embedding

Data & training details #46