xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.79k stars 132 forks source link

Do other languages also work? #36

Closed FBosler closed 1 year ago

FBosler commented 1 year ago

Is it possible to use instructor on other languages than english and get meaningful results?

Cheers

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interests in the INSTRUCTOR model!

The INSTRUCTOR model has only been trained on English texts, so it may not support other languages now.

Feel free to add any further questions or comments!

wilfoderek commented 1 year ago

the process for non english would be translate all dataset to the required language or are there any other further modifications?

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interests in the INSTRUCTOR model!

There is no further modifications to the non-English texts. The tokenizer will first convert non-English texts to numerical numbers. For example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large')
s = '我很快乐'

print("numerical representations:",tokenizer(s)['input_ids'])

Feel free to add any further questions or comments!

hongjin-su commented 1 year ago

Feel free to re-open the issue if you have any further questions or comments!