yaushian / mSimCSE

mSimCSE: Multilingual SimCSE
MIT License
34 stars 1 forks source link

if fine-tuning is supported #2

Closed Lynnjl closed 1 year ago

Lynnjl commented 1 year ago

If we want to fine tuning the model using our internal data, is it feasible?

yaushian commented 1 year ago

It's feasible to finetune our model on your own data as long as it follows the data format. The data format should be "sent0,sent1,hard_neg". An example is in "data/nli_for_simcse.csv".

Lynnjl commented 1 year ago

Thank you for your kindly response. I have another question. It seems that it performs well in the tasks such as : Chinese vs English. But I once tried a multilingual model, it performs bad on the semantic similarity within a language. Actually, I wonder if we can use a uniform model for semantic similarity within language and cross language.

yaushian commented 1 year ago

The Table 3 in the paper shows the results of both within a language and cross languages. Therefore, we can have a uniform model for both tasks. We believe that for the languages that don't have enough training data, the semantic similarity within a language can have reasonably good performance due to cross-lingual transfer. However, for high-resource languages such as English, it doesn't require cross-lingual transfer to enhance its performance, and our model may not be as good as a model trained on the target language such as original SimCSE.