webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
548 stars 60 forks source link

Embeddings in EmbeddingKMeans and ContrastiveActiveLearning #13

Closed kayuksel closed 2 years ago

kayuksel commented 2 years ago

Hi! Do they support embeddings from a language-agnostic model like LabSE or XLM-RoBERTa? (as this is not the case in their papers). Would it be possible to use any embeddings that we previously extract with those methods? If so, how we can do that? I believe that this could be very crucial for this library for not limiting its use to only English-language or any specific encoder.

chschroeder commented 2 years ago

Hi @kayuksel, as for the general capabilities, I don't see any language or encoder limitations so far, and I fully share your opinion on this.

I am no expert on either one of these models but I assume you want to use the sentence transformer model for LabSE and the huggingface implementation of XLM-Roberta, is this correct?

The latter is easy, you could take the example notebook and it should be enough to pass the transformer model name for the respective XLM (e.g., xlm-roberta-large).

For the former, you would also just need a slightly different preprocessing, which requires just a few lines of code. Other than that it should work fine (because everything is an embedding at this point as you correctly mentioned). The same is true if you want to use pre-computed embeddings. Let me know if you encounter problems with this. Sentence transformers could likely get a better support in the coming months.

kayuksel commented 2 years ago

HuggingFace implementations would be sufficient I believe.

P.S. I actually found a distilled version of XLM-Robeta-Large.

This seems to be the best language-agnostic option now.

MiniLMv2
chschroeder commented 2 years ago

Replacing the transformer model name in one of the demo notebooks with nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large already worked for me. The classification results, however, look considerably worse than before.

I also tried https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384 but this seems to be a special case which is not supported right now. They use "BertModel with XLMRobertaTokenizer" and therefore (at least?) the AutoTokenizer does not work.