yuzhimanhua / MICoL

Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification (WWW'22)
Apache License 2.0
29 stars 4 forks source link

A question #2

Open GCTTTTTT opened 2 years ago

GCTTTTTT commented 2 years ago

I want to ask that what is the origin of predicted_label in MAG_candidates.json?

yuzhimanhua commented 2 years ago

Hi,

Those "predicted labels" come from exact name matching and BM25 retrieval. You can refer to Section 3.2 in our paper (https://arxiv.org/pdf/2202.05932.pdf).

The contribution of BM25 in the retrieval stage is not very significant. That being said, if you want to approximately get the "predicted labels", you can implement a very simple exact name matching strategy. Specifically, if the name of a label appears in a document, it will be added to the "predicted labels". The result of this strategy should approximate what we show in MAG_candidates.json well.

GCTTTTTT commented 2 years ago

Hello, I want to ask that whether "venue","author","reference" and "citation" properties are required to run this model in {dataset}_test.json and {dataset}_train.json

yuzhimanhua commented 2 years ago

Hi,

These fields are NOT required in {dataset}_test.json, but they are required in {dataset}_train.json.

If your own datasets do not have such metadata information, you can use our MAG_train.json or PubMed_train.json for training and your own test set for testing. However, I cannot guarantee our model's performance in such a "transfer learning" setting.

GCTTTTTT commented 2 years ago

oh thanks, but If I use MAG_train.json for training and testing my own test set, whether the {dataset} _label.json and the {dataset} _candidates.json should correspond to my own test set?

yuzhimanhua commented 2 years ago

Yes, those two json files should correspond to your own test set.

If you do not have ground truth labels and just want to do predictions, you can remove the last line in run.sh. https://github.com/yuzhimanhua/MICoL/blob/master/run.sh#L12

GCTTTTTT commented 2 years ago

Hello!Thanks for your patient answer! I use my own data in test.json and those two json files, the prepare.sh seens successfully runned but the run.sh had some Errors as follow. What maybe the reason of the errors?

Namespace(adam_epsilon=1e-08, architecture='cross', bert_model='scibert_scivocab_uncased/', eval=False, eval_batch_size=128, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, max_contexts_length=256, max_grad_norm=1.0, max_response_length=256, model_type='bert', num_train_epochs=1.0, output_dir='MAG_output/', poly_m=0, print_freq=500, seed=12345, test_file='MAG_input/test.txt', train_batch_size=4, train_dir='MAG_input/', use_pretrain=True, warmup_steps=100, weight_decay=0.01) Traceback (most recent call last): File "main.py", line 158, in tokenizer = TokenizerClass.from_pretrained(os.path.join(args.bert_model, "vocab.txt"), do_lower_case=True, clean_text=False) File "/home/hxx/miniconda3/envs/pytorch/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained f"Calling {cls.name}.from_pretrained() with the path to a single file or url is not " ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead. Namespace(adam_epsilon=1e-08, architecture='cross', bert_model='scibert_scivocab_uncased/', eval=True, eval_batch_size=128, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, max_contexts_length=256, max_grad_norm=1.0, max_response_length=256, model_type='bert', num_train_epochs=1.0, output_dir='MAG_output/', poly_m=0, print_freq=500, seed=12345, test_file='MAG_input/test.txt', train_batch_size=4, train_dir='MAG_input/', use_pretrain=True, warmup_steps=100, weight_decay=0.01) Traceback (most recent call last): File "main.py", line 158, in tokenizer = TokenizerClass.from_pretrained(os.path.join(args.bert_model, "vocab.txt"), do_lower_case=True, clean_text=False) File "/home/hxx/miniconda3/envs/pytorch/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained f"Calling {cls.name}.from_pretrained() with the path to a single file or url is not " ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

yuzhimanhua commented 2 years ago

Hello,

Sorry for my late reply. I re-ran the code from my side and it worked well, so I am not quite sure about the reason. I guess it is still due to the package version issues. Could you please try to switch to Python 3.6 and refer to https://github.com/yuzhimanhua/MICoL/blob/master/requirements.txt for the versions of torch and transformers?

Thanks!