xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Replication of Instructor #42

Open aamir-s18 opened 1 year ago

aamir-s18 commented 1 year ago

Hey, we are currently trying to replicate the Instructor model. Issue #14 already asks this, but please report the exact training setup for the models.

Also, I am interested in the loss of your model. I didn't get your reported results by running the model for 100k steps. It could be more evident to me how you used just 40k steps for the model while you mentioned in your paper that you trained it on the MEDI dataset.

I would appreciate your help here :)

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

As the MEDI dataset contains a large volume of data, there is no need to complete training on all of them. In fact, as some sources in MEDI may contain similar data, there may be overfitting problem if the training goes up to 100k steps.

For your reference, we use the following command in the training:

python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir {output_directory} --cache_dir {cache_directory} --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.01 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir

Feel free to add any further questions or comments!

aamir-s18 commented 1 year ago

Hey,

But for your published model, what data exactly did you train it?

Also, loss and batch size are missing (in your report). If you say 40k steps, for example, the size of the samples differs a lot based on the batch size. It would be great if you could report the exact training setup to replicate and verify your work.

Thanks!

hongjin-su commented 1 year ago

Hi, we train the model on the MEDI data, which you can download from https://drive.google.com/file/d/1vZ5c2oJNonGOvXzppNg5mHz24O6jcc52/view?usp=sharing. In our setting, we only use the batch size 4

aamir-s18 commented 1 year ago

Hey,

could you please report the loss as well. So it means that you only train it on 4 * 40k data samples of the MEDI dataset and for 1 epoch?

hongjin-su commented 1 year ago

Hi,

  1. The loss of the model is in general between .4 and .5 for all the three models.
  2. Yes, we provide abundant resources in the MEDI data, and some of them may be similar. Therefore, there is no need to finish training the model on all the data.
yangjianxin1 commented 1 year ago

the batch size of 4 is very small for contrastive learning, maybe it should be larger, such as 32 or 64?

hongjin-su commented 1 year ago

Yes, the model is probably better with a larger training batch size. However, due to the limit of the machine, we may leave the further scaling to future work!

iavinasoss commented 1 year ago

Hi, we train the model on the MEDI data, which you can download from https://drive.google.com/file/d/1vZ5c2oJNonGOvXzppNg5mHz24O6jcc52/view?usp=sharing. In our setting, we only use the batch size 4

Hey, i had a small question.

Where can we change the batch_size ? I can't find any argument for it.

Thanks

hongjin-su commented 1 year ago

Hi, you may change the batch size via the argument per_device_train_batch_size.

iavinasoss commented 1 year ago

Got it, Thank you for the help.

YihanWang617 commented 1 year ago

Hi, I am also trying to replicate your work. May I know how many GPUs do you use in training?

hongjin-su commented 1 year ago

Hi, we use only a single GPU in the training.

EliverQ commented 1 year ago

Hey, we are currently trying to replicate the Instructor model. Issue #14 already asks this, but please report the exact training setup for the models.

Also, I am interested in the loss of your model. I didn't get your reported results by running the model for 100k steps. It could be more evident to me how you used just 40k steps for the model while you mentioned in your paper that you trained it on the MEDI dataset.

I would appreciate your help here :)

Hey! I also encountered issues with reproducing the results. Have you successfully replicated the INSTRUCTOR's performance? Even though I used the exact same settings, I couldn't achieve success. If you have succeeded, could you please give me some advice? Thank you very much.

aamir-s18 commented 1 year ago

@EliverQ could you hit me up through email aamir.shakir [at] epfl.ch

YihanWang617 commented 1 year ago

Hi, I have the same issue and cannot replica the results reported on the paper. Could the authors provide the exact training commands of the checkpoints released?