xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

cannot reproduce the results of INSTRUCTOR. #84

Closed qiuwenbogdut closed 9 months ago

qiuwenbogdut commented 1 year ago

using the official open-source weights hkunlp/instructor-large · Hugging Faceto test the retrieval task ArguAna, and the ndcg_at_10 metric can reach 0.57.

Next, I used the following command for training:

python train.py 
--model_name_or_path "/sentence-transformers/gtr-t5-large" 
--output_dir "/train_output" 
--cache_dir "/medi-data" 
-max_source_length 512 
--num_train_epochs 1 
--save_steps 500 
--cl_temperature 0.01 
--warmup_ratio 0.1 
--learning_rate 2e-5 
--per_device_train_batch_size 2 
--gradient_accumulation_steps 2 
--preprocessing_numz_workers 50 
--dataloader_num_workers 50
The training parameters are the same as described in the paper, with a batch_size=4 and training for 20k steps. The resulting model performance is as follows: step:1k step:20k step:54K
0.52 0.4957 0.47
     

I have not been able to reproduce the target result: ndcg_at_10=0.57

Other people have also encountered similar issues: https://github.com/xlang-ai/instructor-embedding/issues/42

How can I train the model to reproduce the results of the paper? Can you help me with this, please?

hongjin-su commented 1 year ago

Probably you can try temperature 0.1 and set num_train_epochs=10 (this may affect the warmup process)

hongjin-su commented 9 months ago

Feel free to re-open the issue if you have any further questions or comments!