Additional options for Discriminative Active Learning

webis-de / small-text

Active Learning for Text Classification in Python

https://small-text.readthedocs.io/

MIT License

562 stars 61 forks source link

Additional options for Discriminative Active Learning #56

Closed emilysilcock closed 6 months ago

emilysilcock commented 9 months ago

Hi!

First of all, thanks so much for a fantastic library - this was amazing to discover and has saved me weeks of time.

I'm using your implementation of Discriminative Active Learning (DAL) and I believe it is cold-starting the classifier every iteration.

The DAL paper recommends further finetuning the model that was trained for classification - and in their accompanying blog post they show that this significantly outperforms cold starting the model. (Blog post: https://dsgissin.github.io/DiscriminativeActiveLearning/2018/07/05/DAL.html)

I was wondering if this could be an option in your implementation of DAL.

Thanks! Emily

chschroeder commented 9 months ago

Hi Emily, Thank you for the kind feedback! I'm happy to hear that small-text is useful to you.

I am open to any extensions of the discriminative active learning implementation. The only thing I am trying to ensure that the default parameters give you the (original) method described in the accompagnying paper. It has been a while since I implemented this, but after a brief glance at both implementation and paper I would say that the current state matches the paper. Do you agree on this?

Regarding the extension: Thank you for the link to this blog post; I was unaware of this. If I understand your proposal correctly, the idea is to use the current classification model as a starting point for the (discriminative) binary model. Is this accurate? If so, then this should be quite easy to add. you have a specific use case in mind for applying this strategy? This might be a good opportunity to test whether the extended implementation works as intended.

Best regards Christopher

emilysilcock commented 9 months ago

Hi Christopher,

Thanks for the super quick reply!

My understanding of the DAL paper is that their default implementation is with this extension - though they don't go into a huge amount of detail, and I might have misinterpreted! In this paragraph below, they say that using the current classification model as a starting point for the (discriminative) binary model is important for performance.

Thanks! Emily

chschroeder commented 9 months ago

Yes, I think this part, while lacking detail, explains that you can either use the original representation $\mathcal{X}$ or the learned representation $\hat{\mathcal{X}}$, where the latter is reported to be more effective. Luckily, there seems to be an implementation by the original authors to answer the remaining questions. It seems my implementation matches the "basic" discriminative active learning that operates on the original representations, but this does not change the fact that the learned representation is likely better (and also what you want in this case).

One thing I forgot to consider is that the proposed extension's models should take vector representations as input, as opposed to the current dataset abstractions. While it is easy to build this for one specific model, it will be more difficult to implement it in a way that works for different model classes. This could be the reason that I stopped at the current implementation; unfortunately I cannot remember as quite some time has passed since then.

I can try to come up with something but it will likely not be this week. What is your time frame for this? Which model are you planning to use?

emilysilcock commented 9 months ago

At the moment I've written something simple that just copies the current classification model, which I'm using for now, so no particular timeframe on my end! This ties you to using the same hyperparameter when training the DAL classifier and the standard classifier, but that wasn't a particular problem for me

class DiscriminativeActiveLearning_amended(small_text.query_strategies.strategies.DiscriminativeActiveLearning):

    # Amended to use the most recent topic classifier as per the DAL paper

    def _train_and_get_most_confident(self, ds, indices_unlabeled, indices_labeled, q):

        ###
        # if self.clf_ is not None:
        #     del self.clf_

        # clf = self.classifier_factory.new()

        clf = active_learner._clf
        ###

        num_unlabeled = min(indices_labeled.shape[0] * self.unlabeled_factor,
                            indices_unlabeled.shape[0])

        indices_unlabeled_sub = np.random.choice(indices_unlabeled,
                                                 num_unlabeled,
                                                 replace=False)

        ds_discr = DiscriminativeActiveLearning_amended.get_relabeled_copy(ds,
                                                                   indices_unlabeled_sub,
                                                                   indices_labeled)

        self.clf_ = clf.fit(ds_discr)

        proba = clf.predict_proba(ds[indices_unlabeled])
        proba = proba[:, self.LABEL_UNLABELED_POOL]

        # return instances which most likely belong to the "unlabeled" class (higher is better)
        return np.argpartition(-proba, q)[:q]

chschroeder commented 8 months ago

Quick update: I have very little time at the moment, but I have a first version for discriminative active learning on representations, specifically for transformer models, which should be considerable more efficient.

At the same time I have added automatic mixed precision in the 2.0.0 branch, which might be interesting for this strategy as well, considering I found the runtime to be the main drawback of discriminative active learning.

I have to properly finish this, and then I will run some sanity checks and runtime comparisons. If you or anyone else is interested, this could be turned into a similar blog post as above ;). Unfortunately, I have to stop before that.

emilysilcock commented 8 months ago

Sounds great! Is the discriminative active learning on representations in one of the branches? I couldn't see it in either the amp or dev branches

Happy to try and test - thought time is not being my friend so much at the moment either. DAL is definitely very slow for me at the moment

chschroeder commented 8 months ago

The relevant code update has just been pushed a few minutes ago, but it is still in a very rough state and is a work in progress. See the discriminative-al branch. A gist with a usage example (based on one of the examples from example folder) is here.

Please be cautious, this is neither tested not polished yet.

chschroeder commented 6 months ago

Sorry for the long wait. I have been busy and so have my GPU resources. The implementation is now almost done except for some final tests and clean up. I will likely finish it and merge it to the dev branch in the course of the day.

The implementation is now a little bit more sophisticated compared to before, and moreover, an option for stochastic sampling was added. It seems really fast compared to the rather slow DiscriminativeActiveLearning strategy. Admittedly, I have not looked at the numbers yet, but I have recorded to the running times for the experiment below.

Small experiment (only intended as a sanity check)

The setup is close to my previous experiments. I compare the new representation-based strategies, i.e. the ones where we use the learned representation as an input, and compare it to Breaking Ties on the AG News dataset (news domain, 4 classes, balanced). Active learning starts at 25 samples and I perform 20 iterations in each of which 25 more samples are labeled.

disc-al-tests-curve

This is of course an easy setting with a balanced dataset. For a real experiment I would add a random baseline as a next step. The fact that increasing the sub-batches seems to improve the learning curve seems reasonable and at least gives some confidence that the implementation might be correct. Also, it seems to perform better with SetFit, whose embeddings are more meaningful. Finally, the curve looks similar to my experiences with the DiscriminativeActiveLearning strategy which fails to achieve final accuracy values close to BreakingTies.

I still wouldn't recommend it as a first choice, but it might be interesting for datasets with a highly imbalanced class distribution or for experiments that want to extend this strategy.

simon-lowe commented 2 months ago

Hi!

I'm a collaborator of @emilysilcock. Thank you so much for this great package. Sorry that I am reopening this topic, but I wanted to ask for some clarification on the DiscriminativeRepresentationLearning class. I’ve been trying to dig into your code, but admittedly my PyTorch is only mediocre. I’m hoping asking you directly might be a bit easier. From my understanding of the code, and I might be totally wrong, I don’t think it entirely aligns with my understanding of the blog (and the paper). In particular, this came up while we were running the code on some of our tests and obtained behavior that we didn’t quite understand.

So let me describe what we believe the method to be:

1 iteration of active learning:

Models:
- Model_y: Model for outcome classification. Generally, Transformer with classification head
- Model_l: Model for is-labeled classification. Generally, Transformer with classification head
Input:
- Labeled data: (text_i, y_i) where y are the labels
- Unlabeled pool: (text_i)
Parameters:
- Batch_size: Number of elements to be externally labeled
- Sub_batch_size: Size of sub-batches for batch-awareness
- Unlabeled_size_fraction: relative size of elements from unlabeled pool compared to labeled data for training model_l
Steps:
1. Train model_y on labeled data, initialized with random weights
2. Train model_l on (labeled data + randomly sampled unlabeled_size_f*n_labeled_data from the unlabeled data pool).
  - Initialize model_l from model_y
  - Train full transformer + classifier head model (ie not just the classifier head)
3. Pull out sub_batch_size points that are classified by model_l as most likely to be unlabeled
4. Re-train model_l, considering the pulled out sub_batch_size elements from the previous step as labeled
5. Simply resample unlabeled_size_f*n_labeled_data_new from the pool and add to the training data for model_l
  - Alternatively could simply add new unlabeled from the pool
  - Repeat steps iii. and iv. until a full batch_size of data points to be labeled is pulled out
6. Manually label the pulled out data points

Does this correspond to the way you have coded it up? In particular, it seemed from the code that you were only re-training the classifier head, but again I might be completely wrong. And is the process with the sub-batches the same? Finally, is there an equivalent of the Unlabeled_size_fraction parameter?

Again thank you so much for the amazing package.

chschroeder commented 2 months ago

Hi @simon-lowe,

Thank you for the kind words. I am happy that the package seems to be useful to you :).

Does this correspond to the way you have coded it up?

I would prefer explicit questions here, since this is again another presentation than in the paper, and I might misinterpret this and give you a false confirmation. From a quick glance, the overall process seems to be right except for the points addressed in the following. Still, do not take this as a confirmation.

I would reconsider if you want to specify the input to be text (text_i)
Sub_batch_size: Size of sub-batches for batch-awareness:
What do you mean with "batch-awareness"?
Train full transformer + classifier head model (ie not just the classifier head):
Here we "only train the head".
I think (ii) can be removed and (iii) and (iv) need to be swapped. You train a new model_l during each iteration in the loop.

The paper omits the details from the outer active learning loop, which you included (e.g., Model_y). Therefore, most of this is unspecified in the paper, and you have to fill in the gaps without deviating from the original methods.

In particular, it seemed from the code that you were only re-training the classifier head, but again I might be completely wrong.

Yes, this is the case. See the section "The Representation" in the blogpost, where Gissin writes about different representations. For image data, he mentions the options of using raw data as features, which is not translatable to text and transformer models. In our case, the representation we use for text is the embedding just before the classification head, which is common for text classification. Training only the head means that consider the representation as input data and keep it fixed, thereby speeding up the classification considerably. Training the full model would also not deviate from the paper, but it would dramatically increase the runtime costs with no guaranteed benefit.

And is the process with the sub-batches the same?

You mean if it is the same as in your list above? I think this should be addressed in the answer above. If not, please tell me.

Finally, is there an equivalent of the Unlabeled_size_fraction parameter?

There is the unlabeled_factor parameter. Is this what you were looking for?