xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.87k stars 135 forks source link

fine-tune HKUNLP/instructor-embedding #74

Open Atlantic8 opened 1 year ago

Atlantic8 commented 1 year ago

can we fine-tune using train.py based on the released model hkunlp/instructor-xl? If yes, could you please show me the shell script for training? thanks

Atlantic8 commented 1 year ago

I only have training data with format: sentence1, sentence2, label so I cannot construct training data with format: query=xxx, pos=[], neg=[]

Atlantic8 commented 1 year ago

Also, when I trying to train using train.py, with "--fp16 True --gradient_accumulation_steps 3", I got out of GPU memory. I was using A100 40G. why training this model takes this much GPU memory. could you tell me the GPU hardware you used to train this model?

Atlantic8 commented 1 year ago

btw, this model can be trained only when per_device_train_batch_size is set to 2

ashokrajab commented 1 year ago

could you tell me the GPU hardware you used to train this model?

@Atlantic8 , this is an excerpt from the paper:

We use the maximum batch size that fits the machine memory and run all our experiments on 40GB A100 GPUs.

taziksh commented 1 year ago

btw, this model can be trained only when per_device_train_batch_size is set to 2

What's your source for this? @Atlantic8

hongjin-su commented 11 months ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

  1. As the INSTRUCTOR model follows the same architecture as GTR models, the same training script should be applicable.
  2. If you have only paired sentences (I assume that they are positive pairs, e.g., question and answer), then using random negatives is probably the easiest way to construct the training data.
  3. For the xl model, the maximum length, gradient accumulation steps and batch size should depend on your machines.

Hope this helps!

EricPaul03 commented 6 months ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

  1. As the INSTRUCTOR model follows the same architecture as GTR models, the same training script should be applicable.
  2. If you have only paired sentences (I assume that they are positive pairs, e.g., question and answer), then using random negatives is probably the easiest way to construct the training data.
  3. For the xl model, the maximum length, gradient accumulation steps and batch size should depend on your machines.

Hope this helps!

So for custom data, do we need to randomly construct a data format like query=xxx, pos=[], neg=[] before running?