Closed FBosler closed 1 year ago
Hi, Thanks a lot for your interests in the INSTRUCTOR model!
The INSTRUCTOR model has only been trained on English texts, so it may not support other languages now.
Feel free to add any further questions or comments!
the process for non english would be translate all dataset to the required language or are there any other further modifications?
Hi, Thanks a lot for your interests in the INSTRUCTOR model!
There is no further modifications to the non-English texts. The tokenizer will first convert non-English texts to numerical numbers. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large')
s = '我很快乐'
print("numerical representations:",tokenizer(s)['input_ids'])
Feel free to add any further questions or comments!
Feel free to re-open the issue if you have any further questions or comments!
Is it possible to use instructor on other languages than english and get meaningful results?
Cheers