Closed Chirilla closed 2 months ago
Hi @Chirilla,
I briefly looked into the notebook, but as I can't see your data this is difficult to diagnose remotely. How many samples are in your dataset? What is the average number of tokens? From the name I know that it is fake news data, this might of course be tricky. I assume all data is English?
BreakingTies()
as long as I don't have a reason not to.Thank you for your reply! The dataset has 6335 samples and I consider only 'text' and 'label' attributes of every news (all data is English) for my analysis. I already tried SklearnClassifier and it gives me better results (accuracy near or above 0.9) and turns out to be much faster. I tried some of the other strategies (BreakingTies() included) with kimCNN because I would like to compare them with this classifier but I have similar results so I think I’m doing something wrong with it but I can't figure it out what
Then we can rule out that it is a problem with the data, good.
How long is an average text? You have specified a maximum token length of 512, so if the texts are considerable longer, tokens beyond this points "are invisible" during training. If they are considerably shorter you should reduce this value as well.
Otherwise, you can still try to train KimCNNClassifier on the full dataset to verify if this works. Parameters with the most influence are the learning rate (lr
), the kernel sizes (kernel_heights
) and the number of output channels (out_channels
).
Besides that, the demo example lowercases the word tokens, which could be a problem if the striking fake news features would be something like proper names.
try:
from torchtext import data
text_field = data.Field(lower=False) # I replaced True with False
label_field = data.Field(sequential=False, unk_token=None, pad_token=None)
except AttributeError:
# torchtext >= 0.8.0
from torchtext.legacy import data
text_field = data.Field(lower=False) # I replaced True with False
label_field = data.Field(sequential=False, unk_token=None)
where lower=True
controls the lowercaseing. I could be mistaken with the last point as I haven't used torchtext in a while (and it will be replace with the next major version), but you could try to disable the lowercaseing here.
Hi, I tested the KimCNNClassifier training it on the full dataset and it seems to work but the Factory still gives me bad results (while the Sklearn version gives me 0.9 accuracy) and I can't understand why. I tried with a different embedding matrix (random) and with lower=false but I haven't the same results
As long as you have really not used different settings, there should not be a difference.If you need another sanity check, you can train using 10%, 20%, ..., 90%,100% of the data. Then you will see if the results rapidly improve after a certain percentage of the data has been used.
KimCNN can achieve results that are slightly above the SVM, but for simple concepts the SVM might learn the right features faster. Depending on what you are trying to do, I would not recommend using KimCNN.
Closing this due to inactivity. Feel free to reopen if necessary.
Hi, sorry to bother you but I'm trying to work with the pytorch integration using a json dataset from another repository and the accuracy results are strangely low. I used the example "pytorch_multiclass_classification.py" as a basis and I can't undestand where I'm making a mistake. I have uploaded a zip with the notebook... do you have any pytorch text classification you can share? Or any advice? Thanks in advance ST.zip