xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Comparative Performance Analysis: Single Dataset Fine-Tuning Versus Multi-Dataset Instruction-Based Fine-Tuning on Task A #98

Closed sunzhaoyang1 closed 6 months ago

sunzhaoyang1 commented 7 months ago

Which would perform better on task A: fine-tuning on only dataset A vs. instruction-based fine-tuning across multiple datasets (tasks)?

hongjin-su commented 7 months ago

Thanks a lot for your interest in the INSTRUCTOR!

To achieve the best performance in task A, it should depend on its characteristics. If all queries/docs in task A are from the same domain, in uniform formats, intuitively, I would say fine-tuning on only dataset A would lead to the best performance. However, if the queries/docs are diverse in task A like msmarco, it would be better to use instruction-based fine-tuning for models to adapt to different scenarios, domains, etc.

A third option is to first apply instruction-based fine-tuning across multiple datasets, and then (also instruction-based fine-tuning, to keep training consistent) on dataset A when dataset A is small. This will help the model first learn general embedding capabilities, and then focus on the specific task/domain of dataset A.

Hope this helps!

sunzhaoyang1 commented 7 months ago

Thank you very much for your advice and insights, they are greatly appreciated!

hongjin-su commented 6 months ago

Feel free to re-open the issue if your have any questions or comments!