zihaohe123 / wiki-enhanced-stance-detection

20 stars 4 forks source link

Topics used for Wikipedia definitions #5

Open dawidm opened 11 months ago

dawidm commented 11 months ago

Hello,

If I understand the code correctly, you use 'new_topic' column for getting Wikipedia definitions (but 'topic_str' for classification model). Working with VAST dataset I've noticed, that 'new_topic' values are incorrect for synthetic neutral samples (type_idx=4). This makes target definitions incorrect for most neutral samples. Please correct me if I'm wrong.

zihaohe123 commented 11 months ago

Hi Dawid,

Yes that's how my code works, and thanks for pointing out this issue with the VAST dataset. I encourage you to raise an issue the original repo of this dataset.

If the targets are non-existent in the text, then being labeled as neutral is not a problem. Stance being neutral can happen in both cases: 1) The target is in the text but the author does not show a clear stance towards it; 2) The target is not existent at all.

I hope this addresses your concerns.

dawidm commented 11 months ago

Thanks for your answer. I think this may be a little trickier. The targets for neutral samples are wrong, but in a specific way. As far as I understand, they just haven't been changed to a random one in order to make a sample neutral. So they are highly related to post text, making Wikipedia definitions also highly related. In my opinion, this could affect results, making such samples easier to classify as neutral ones.

zihaohe123 commented 11 months ago

Could you show some examples?

dawidm commented 11 months ago

I mean all samples that have type_idx==4 in VAST train set. What I suppose about making them easier to classify with incorrect definitions is potentially confirmed by improved score for neutral class (~0.925) compared to BERT alone (<0.900) which shouldn't happen when definitions for neutral samples are wrong.