How would you use instructor with DNA sequences?

xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Apache License 2.0

1.85k stars 134 forks source link

How would you use instructor with DNA sequences? #43

Closed lborcard closed 9 months ago

lborcard commented 1 year ago

Thank you for your work,

I was wondering if it would be possible to train instructor to do embedding on DNA sequences for clustering/classification.

best, Loïc

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

I think it is possible to further finetune the INSTRUCTOR model on DNA sequences. Here is the instruction about how to prepare the data and train the model: https://github.com/HKUNLP/instructor-embedding#training

Feel free to add any further questions or comments!

hongjin-su commented 1 year ago

Please re-open the issue if you have any questions or comments.

lborcard commented 1 year ago

Dear Hongjin,

As said above I am trying to fine-tune with DNA sequences, it seems that the train.py script uses medi.json even if you give another json file as --train_file? Do I need to train it from scratch before I do any other type of fine-tuning? Another question is : what kind of prompt would I need to use in order classify (binning) sequences .

lborcard commented 1 year ago

Otherwise, should I run another tokenizer if I use DNA sequences?

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

By default, the INSTRUCTOR was trained with medi-data.json. Probably you don't need to train from scratch for domain-specific finetuning. To classify sequences, you may use the prompt like Represent the biological sentence for classification:.

I don't think we need another tokenizer for DNA sequences.

lborcard commented 1 year ago

But the option --cache_dir will look for medi.json by default, is there another option to fine-tune with a custom dataset?

hongjin-su commented 1 year ago

In this case, you may just prepare your own dataset using this format, and name it as medi-data.json.

hongjin-su commented 9 months ago

Feel free to re-open this issue if you have any further questions or comments!