webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
548 stars 60 forks source link

SEALS: Similarity Search for Efficient Active Learning and Search of Rare Concepts #12

Closed kayuksel closed 2 years ago

kayuksel commented 2 years ago

Hello, thank you for open-sourcing this project. I would like to suggest adding the following method to the library: "Similarity Search for Efficient Active Learning and Search of Rare Concepts" Link: https://arxiv.org/abs/2007.00077 It seems that it can it well in this library, it is also possible to combine that with other methods. Sincerely, Kamer

chschroeder commented 2 years ago

Hi, this looks really interesting, thank you! I will look into it.

chschroeder commented 2 years ago

Update: I have a working implementation by now and I really like this approach. It will be added to small-text soon.

Unfortunately, this also showed that the strategies which I wanted to combine with SEALS, mainly the ones inheriting from EmbeddingBasedQueryStrategy, are currently suboptimal and the performance gain does not carry over to them.

Next, I need to first improve the EmbeddingBasedQueryStrategy class to continue here.

kayuksel commented 2 years ago

Amazing, thanks a lot! Looking forward.

chschroeder commented 2 years ago

@kayuksel SEALS is now available in the v1.0.0b4 release. I am always happy about any kind of feedback in case you have a chance to use this on a larger amounts of data.

There were some other preparations to get this properly incorporated, mainly because I didn't want the index library to be a hard requirement for the rest of the library.

For the index, contrary to the paper, I relied on hnsw instead of faiss. I have used faiss before, and while it is nice, it still can't be installed via pip alone, so I would need a conda environment or similar for the integration tests. This might change in the future and one could even imagine to offer an abstraction over indexes. The main blocker here is that we would need a small-text package for conda.