xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

How the training data is divided? #87

Open wsa-dhu opened 9 months ago

wsa-dhu commented 9 months ago

Hello, I'm very interested in your work, and I'm currently attempting to train a general sentence representation model. I have a question: When my training dataset comes from different domains, how can I ensure that samples within the same batch belong to the same task during the training process? Would it be better to include samples from different tasks within the same batch during training? I'm not sure about my assumption. Could you provide insights based on your experience?

hongjin-su commented 9 months ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task.

wsa-dhu commented 8 months ago

Hello author, I am very interested in your work instructor. I would like to ask about the task_id in your training dataset. I would like to know which datasets these ids correspond to, as my own correspondence may take more time. By the way, I found that there are only 329 task_id, 302 is missing, which does not match the 330 in the paper. Looking forward to your reply and wishing your work success.

---- Replied Message ---- From @.> Date 09/27/2023 21:46 To @.> Cc @.>@.> Subject Re: [xlang-ai/instructor-embedding] How the training data is divided? (Issue #87)

Hi, Thanks a lot for your interest in the INSTRUCTOR! You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hongjin-su commented 6 months ago

We found that we might miss a task_id when we uploaded the dataset. We anticipate to fix it in our next version.

robro612 commented 1 month ago

Is there any update on where to find the meaning of these task_ids?

Edit: Sorry, I was looking at the dataset link on the paper website which appears to be stale, the link in the README has actual dataset names in the id.