Closed Lynnjl closed 1 year ago
It's feasible to finetune our model on your own data as long as it follows the data format. The data format should be "sent0,sent1,hard_neg". An example is in "data/nli_for_simcse.csv".
Thank you for your kindly response. I have another question. It seems that it performs well in the tasks such as : Chinese vs English. But I once tried a multilingual model, it performs bad on the semantic similarity within a language. Actually, I wonder if we can use a uniform model for semantic similarity within language and cross language.
The Table 3 in the paper shows the results of both within a language and cross languages. Therefore, we can have a uniform model for both tasks. We believe that for the languages that don't have enough training data, the semantic similarity within a language can have reasonably good performance due to cross-lingual transfer. However, for high-resource languages such as English, it doesn't require cross-lingual transfer to enhance its performance, and our model may not be as good as a model trained on the target language such as original SimCSE.
If we want to fine tuning the model using our internal data, is it feasible?