webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Low values in pytorch integration #62

Closed Chirilla closed 2 months ago

Chirilla commented 3 months ago

Hi, sorry to bother you but I'm trying to work with the pytorch integration using a json dataset from another repository and the accuracy results are strangely low. I used the example "pytorch_multiclass_classification.py" as a basis and I can't undestand where I'm making a mistake. I have uploaded a zip with the notebook... do you have any pytorch text classification you can share? Or any advice? Thanks in advance ST.zip

chschroeder commented 3 months ago

Hi @Chirilla,

I briefly looked into the notebook, but as I can't see your data this is difficult to diagnose remotely. How many samples are in your dataset? What is the average number of tokens? From the name I know that it is fake news data, this might of course be tricky. I assume all data is English?

  1. It looks like you have a fully labeled dataset. If so, then you can try to train on the full dataset (without active learning)? If it works, the classifier is not the problem.
  2. Did you try another classifier? For examply a (small) transformer model or an SVM instead of KimCNN.
  3. Since the result before the first iteration looks okay: Did you try another query strategy? I currently use BreakingTies() as long as I don't have a reason not to.
Chirilla commented 3 months ago

Thank you for your reply! The dataset has 6335 samples and I consider only 'text' and 'label' attributes of every news (all data is English) for my analysis. I already tried SklearnClassifier and it gives me better results (accuracy near or above 0.9) and turns out to be much faster. I tried some of the other strategies (BreakingTies() included) with kimCNN because I would like to compare them with this classifier but I have similar results so I think I’m doing something wrong with it but I can't figure it out what

chschroeder commented 3 months ago

Then we can rule out that it is a problem with the data, good.

How long is an average text? You have specified a maximum token length of 512, so if the texts are considerable longer, tokens beyond this points "are invisible" during training. If they are considerably shorter you should reduce this value as well.

Otherwise, you can still try to train KimCNNClassifier on the full dataset to verify if this works. Parameters with the most influence are the learning rate (lr), the kernel sizes (kernel_heights) and the number of output channels (out_channels).

Besides that, the demo example lowercases the word tokens, which could be a problem if the striking fake news features would be something like proper names.

    try:
        from torchtext import data
        text_field = data.Field(lower=False)  # I replaced True with False  
        label_field = data.Field(sequential=False, unk_token=None, pad_token=None)

    except AttributeError:
        # torchtext >= 0.8.0
        from torchtext.legacy import data
        text_field = data.Field(lower=False)  # I replaced True with False
        label_field = data.Field(sequential=False, unk_token=None)

where lower=True controls the lowercaseing. I could be mistaken with the last point as I haven't used torchtext in a while (and it will be replace with the next major version), but you could try to disable the lowercaseing here.

Chirilla commented 3 months ago

Hi, I tested the KimCNNClassifier training it on the full dataset and it seems to work but the Factory still gives me bad results (while the Sklearn version gives me 0.9 accuracy) and I can't understand why. I tried with a different embedding matrix (random) and with lower=false but I haven't the same results

chschroeder commented 3 months ago

As long as you have really not used different settings, there should not be a difference.If you need another sanity check, you can train using 10%, 20%, ..., 90%,100% of the data. Then you will see if the results rapidly improve after a certain percentage of the data has been used.

KimCNN can achieve results that are slightly above the SVM, but for simple concepts the SVM might learn the right features faster. Depending on what you are trying to do, I would not recommend using KimCNN.

chschroeder commented 2 months ago

Closing this due to inactivity. Feel free to reopen if necessary.