webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Using active learning on already trained model #30

Closed etiennekintzler closed 1 year ago

etiennekintzler commented 1 year ago

Hello :)

I am trying to use the library on a transformer model that is already trained. For that matter I don't need to use initialize_data method since the model is already trained, however it seems to be necessary before using query method (otherwise it throws an error).

To be more specific let's say I have an object model (multi-label model from hugging face) trained on data text_train and labels_train. Then I have text_test data for which no labels is available. I would like to use active learning to select the best (based on a given query strategy) samples in text_test to be labelled by my users. How could I use the library to do so ?

Thank you in advance for your help !

chschroeder commented 1 year ago

Hi @etiennekintzler! Two very valid questions that need to be included in the documentation.

1. Bypassing initialize_data

Unfortunately, this is awkward with the current API (but will be changed with version 2.0.0).

A solution is shown in #10. Let me know if this does not work for you. I will also add this to the docs eventually.

2. Creating an unlabeled dataset

For multi-label datasets:

If you create your dataset using TransformersDataset.from_arrays() then you just pass an empty list of labels (i.e. a csr_matrix which does not have any entries).

from small_text import TransformersDataset, list_to_csr

texts = ['this is my document', 'yet another document']

num_classes = ... # omitted
tokenizer =  ... # omitted
target_labels = ... # omitted

y = list_to_csr([[], []], shape=(2, num_classes))

dataset = TransformersDataset.from_arrays(texts, y, tokenizer, target_labels=target_labels)

For single-label datasets: A label of -1 means "unlabeled" (accessible through the constant LABEL_UNLABELED:

from small_text import LABEL_UNLABELED
etiennekintzler commented 1 year ago

Thank you for your fast and detailed answer @chschroeder !

This is not really the answer you'd expect but provided that uncertainty based query strategy like breaking ties and least confident are both simple to implement and works well enough empirically (cf your paper Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers) I decided to just write a simple function for breaking ties querying strategy that can be applied directly on the model probabilities:

def get_bt_from_probas(probas_mat: np.array, num_samples: int = 5):
    argsort_mat = np.argsort(probas_mat, axis=1)
    k2k1_mat = argsort_mat[:, -2:]
    scores = np.array([p[k1] - p[k2] for (p, (k2, k1)) in zip(probas_mat, k2k1_mat)])
    indices = np.argsort(scores)[:num_samples]
    return indices

I could have used the implementation of query strategies in https://github.com/webis-de/small-text/blob/v1.3.0/small_text/query_strategies/strategies.py but it seems tightly coupled to the dataset and classifier while I just needed something to be applied on the model probabilities (I get that for other methods like expected gradient length you'd need the model as well).

Feel free to close the ticket if you want ! I'll be watching the project and would be happy to try it again when 2.0 is out.

chschroeder commented 1 year ago

Thanks for the feedback! Yes, you can of course extract individual parts, but then you lose the benefits of the interface. Nevertheless, regarding small-text functions like this should be separated from the class in small-text (like it was done with the core set strategies, for example), then an import would have sufficed. Adding this to the list of tasks.