Closed lborcard closed 9 months ago
Hi, Thanks a lot for your interest in the INSTRUCTOR model!
I think it is possible to further finetune the INSTRUCTOR model on DNA sequences. Here is the instruction about how to prepare the data and train the model: https://github.com/HKUNLP/instructor-embedding#training
Feel free to add any further questions or comments!
Please re-open the issue if you have any questions or comments.
Dear Hongjin,
As said above I am trying to fine-tune with DNA sequences, it seems that the train.py script uses medi.json even if you give another json file as --train_file
? Do I need to train it from scratch before I do any other type of fine-tuning? Another question is : what kind of prompt would I need to use in order classify (binning) sequences .
Otherwise, should I run another tokenizer if I use DNA sequences?
Hi, Thanks a lot for your interest in the INSTRUCTOR!
By default, the INSTRUCTOR was trained with medi-data.json. Probably you don't need to train from scratch for domain-specific finetuning. To classify sequences, you may use the prompt like Represent the biological sentence for classification:
.
I don't think we need another tokenizer for DNA sequences.
But the option --cache_dir will look for medi.json by default, is there another option to fine-tune with a custom dataset?
In this case, you may just prepare your own dataset using this format, and name it as medi-data.json.
Feel free to re-open this issue if you have any further questions or comments!
Thank you for your work,
I was wondering if it would be possible to train instructor to do embedding on DNA sequences for clustering/classification.
best, Loïc