webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Query strategy that includes selecting high/medium certainty examples #24

Closed MoritzLaurer closed 1 year ago

MoritzLaurer commented 1 year ago

Feature description

The existing query strategies mostly seem to select data the model is particularly uncertain about (high entropy, ties, least confident ...). Are there other query strategies that also mix some data points into the training pool where the model is more certain?

Motivation

Many use-cases I work on deal with noisy data. So after a model has obtained a certain quality, query strategies that only select uncertain examples can actually select data that is of low quality. Instead, it would be good to have a way of also adding some high or medium certainty examples to the training pool. The idea is that this helps the model get some good, not-so-difficult examples to help it learn the task - instead of always feeding it very difficult and potentially noisy/wrong data points that can hurt performance.

This is also an important use-case for zero-shot or few-shot models (like the Hugging Face zero-pipeline), which are getting more and more popular. They already have decent accuracy for the task and selecting highly uncertain examples can actually hurt the training process by selecting noise / examples that are inherently uncertain.

Addition comments

I really like your library and planning on using it for my research in the coming months :)

chschroeder commented 1 year ago

Hi @MoritzLaurer,

thanks for your kind words! I am happy that you like it.

Also, I am always interested in how small-text is used in research / applications / other projects. Feel free to report back here once you have something you are willing to show. (Of course only if you want to.) Soon, there might be a separate section for showcasing such work.

Now, regarding your request:

Preferred solution: Do you have a specific query strategy in mind? Whenever possible, I prefer strategies that have been scientifically evaluated. Out of my mind, I think I have seen strategies that go for high confidence before, but I could not find anything like that when searching a moment ago. Still, I will keep my eyes open and keep you updated.

Quick solution: Until then, if you don't mind to use an "undocumented" approach, the existing confidence-based strategies can be easily adapted.

For example, PredictionEntropy (which select high entropy examples)...

class PredictionEntropy(ConfidenceBasedQueryStrategy):
    """Selects instances with the highest prediction entropy [HOL08]_."""
    def __init__(self):
        super().__init__(lower_is_better=False)

    def get_confidence(self, clf, dataset, _indices_unlabeled, _indices_labeled, _y):
        proba = clf.predict_proba(dataset)
        return np.apply_along_axis(lambda x: entropy(x), 1, proba)

    def __str__(self):
        return 'PredictionEntropy()'

....can be easily changed to select in reverse order:

class LeastPredictionEntropy(ConfidenceBasedQueryStrategy):
    """Selects instances with the lowest prediction entropy."""
    def __init__(self):
        super().__init__(lower_is_better=True)  # <-- change here

    def get_confidence(self, clf, dataset, _indices_unlabeled, _indices_labeled, _y):
        proba = clf.predict_proba(dataset)
        return np.apply_along_axis(lambda x: entropy(x), 1, proba)

    def __str__(self):
        return 'LeastPredictionEntropy()'
MoritzLaurer commented 1 year ago

Hey, thanks for your detailed response!

I don't have a specific paper with a proven query strategy in mind unfortunately. I'm looking into the literature at the moment, will report back if I find something (also stumbled upon your survey paper, the taxonomy in figure 2 is really helpful).

The quick solution looks good for a quick fix, will test it.

chschroeder commented 1 year ago

You are using this for active learning (or planning to), right? If it belonged more to zero shot than active learning I would hesitate to add something as a full query strategy (at least if it might also confuse people, e.g., I would not expect a MostConfidence strategy to perform well in most cases).

Nevertheless, if we find a concept that would fit your use case I am really open to test it.

Moreover, possible relevant papers regarding certainty:

Maybe the problem to investigate is also that "query strategies that only select uncertain examples can actually select data that is of low quality".

If you want to test something on your use cases, I could easily provide the one or other query strategy. Until then I am keeping my eyes open towards for approaches that rely on certainty instead of uncertainty.

MoritzLaurer commented 1 year ago

Yeah, I'm planning on using it for active learning with few-shot models. Concretely, I will use a BERT-NLI model as the active learner. The good thing about universal classifiers like BERT-NLI (or PET or GPT etc.) is that (1) they have zero-shot knowledge which can be used for first (normally random) sampling round and (2) beyond zero-shot they also perform better than standard models when only 100 - ~2000 training data points are available. (see e.g. this manuscript).

I think that an important background for my motivation to include higher confidence samples is that I don't trust the 'oracle' in my use-cases due to noisy data / ambiguous tasks. I'm planning some simulations where texts are not annotated by me, but I take existing annotations from crowd-workers / research assistants and just use these as the 'oracle' for the learning loop. Due to the ambiguity / difficulty of the task, one text could reasonable be attributed to multiple categories (although each data point has only one label, so it was designed multi-class but not multi-label). I'm therefore afraid that high-uncertainty sampling selects specifically these ambiguous examples and confuses the model.

I suppose that an important assumption for all/most query strategies is that the task is perfectly non-ambiguous and the labels are 100% correct? (Which, in my experience, is almost never the case in practice if I look into the data)

Another reason why I want to use active learning is that my data is extremely unbalanced. So another motivation for medium/high confidence examples is also just to get some texts that probably belong to one minority class and it's less relevant that the model is uncertain about them. (zero-/few-shot models are also good for identifying texts from minority classes). (data imbalance might also be something that is under-researched in active learning, because most NLP datasets are artificially balanced?)

Thanks for sharing these articles, I will look into them!

The more I think about it, the more I think I need to play with active learning and different strategies on my data to get a better feeling for active learning, before I can make a more informed request for a specific strategy :)

chschroeder commented 1 year ago

I'm therefore afraid that high-uncertainty sampling selects specifically these ambiguous examples and confuses the model.

This sound more like an assumption so far, am I right? I would test this before making any decisions if I were you. What uncertainty-based strategies certainly do not bring is: (1) representativity, i.e. if you have many similar samples in your datasets they might be selected in the same query; (2) class-balance, which is especially important if you already know that your class distribution is imbalanced.

I suppose that an important assumption for all/most query strategies is that the task is perfectly non-ambiguous and the labels are 100% correct? (Which, in my experience, is almost never the case in practice if I look into the data)

The keyword here is the "noisy oracle". From there you need to find an evaluation of the model to be used in terms of noise resistance.

data imbalance might also be something that is under-researched in active learning, because most NLP datasets are artificially balanced?

Somewhat, yes (at least in NLP). There are some works, e.g. the paper by Ein-Dor et al. is quite popular but there are surely some gaps left. Moreover, I stumbled on a recently published survey that might be interesting.


Apart from that, my next two commits (unrelated to your request but well-fitting) might be interesting:

MoritzLaurer commented 1 year ago

Thanks a lot for all these comments and links, that very helpful! Yeah, that's an assumption, I will test it in the coming months.

And integration with SetFit sounds great, haven't used it yet, but read the paper. Looking forward to the next release.

chschroeder commented 1 year ago

Hi, I will close for now, feel free to reopen (or just open another issue if you have a specific idea).