yumeng5 / LOTClass

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Apache License 2.0
296 stars 62 forks source link

AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce `--match_threshold` (not recommend) #22

Open ForeverNightmare opened 1 year ago

ForeverNightmare commented 1 year ago

Hi, I'm traning my model under your framework. I got this error information:

Number of documents with category indicative terms found for each category is: {0: 9014, 1: 0, 2: 0, 3: 551, 4: 1478, 5: 20642, 6: 0, 7: 7429, 8: 8676, 9: 4814, 10: 1368, 11: 23, 12: 418} Traceback (most recent call last): File "src/train.py", line 66, in main() File "src/train.py", line 57, in main trainer.mcp(top_pred_num=args.top_pred_num, match_threshold=args.match_threshold, epochs=args.mcp_epochs) File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 451, in mcp self.prepare_mcp(top_pred_num, match_threshold) File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 392, in prepare_mcp assert category_doc_num[i] > 10, f"Too few ({category_doc_num[i]}) documents with category indicative terms found for category {i}; " \ AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce --match_threshold (not recommend)

But when I directly run the sh file again(the dataset dir in sh file is replaced with mine), it runs successfully without any error. Will the result I get be correct? Does the previous error message "affect" this result to make it wrong?

yumeng5 commented 1 year ago

Hi,

The error is pretty much explained by the printouts -- for several categories (1, 2, 6) there are 0 documents with category indicative terms (as indicated by the dictionary printed out). So you probably need to add more documents likely to pertain to these categories to the corpus; otherwise, there is no way of training the classifier to detect these categories (and of course, the resulting classifier won't be accurate).

Thanks, Yu

ForeverNightmare commented 1 year ago

Hi @yumeng5 ,

Thanks for your reply! My question is, my training dataset includes about 230,000 pieces of data, and each label of my 12 labels has many instances in the dataset. So I'm really confused how can the "Too few (0) documents with category indicative terms found for category 1" happens. Like for label 6, there are 2839 instances in the dataset, but the number of documents with category indicative terms found for 6 is 0. While for label 10, there are 808 instances but the number of documents with category indicative terms found for 10 is 1368, even more than 808.Label 5, 6482, but 20642 is shown. Based on your understanding of your thesis, would you mind speculating on what caused this result?

yumeng5 commented 1 year ago

The number of documents found with category indicative terms is derived based on the category vocabulary constructed in the first step and is not directly related to the actual number of instances in that category -- does the category vocabulary make sense for those categories without enough matching documents (e.g., label 1, 2, 6)?

I'd suggest trying different label names (more common and distinctive terms tend to work better) and checking the category vocabulary accordingly.

Thanks, Yu

ForeverNightmare commented 1 year ago

@yumeng5 Thanks for your seggestions! Now I started training on a new dataset and met a new issue. I set the parameter like this: MCP_EPOCH=20 SELF_TRAIN_EPOCH=10

But the result shows that the self train epochs are only excuted 2 time: 100%|██████████| 226/226 [01:41<00:00, 2.22it/s]lr: 9.929e-07 Average training loss: 0.10797090083360672 Test acc: 0.7305699586868286 lr: 8.905e-07 Average training loss: 0.11300306767225266 Test acc: 0.7253885865211487 Saving final model to datasets/movies/final_model.pt

What may cause this? I didn't set the early step parameter in .sh file so it should be false.