zjunlp / DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
http://deepke.zjukg.cn/
MIT License
3.41k stars 675 forks source link

ner模型无法预测 #121

Closed callmeiron closed 2 years ago

callmeiron commented 2 years ago

Describe the question

A clear and concise description of what the question is. 想用ner做爬取文档的导弹类命名实体的识别,自己准备了训练数据,大概一共4000字,run没问题,但是predict时候总是出错,是训练数据太少了吗

Environment (please complete the following information):

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here. C:\Python38\python.exe C:/Users/Administrator/Desktop/project1-jiaohe/DeepKE-main/example/ner/standard/predict.py C:\Python38\lib\site-packages\setuptools\distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first. warnings.warn( 07/07/2022 16:05:08 - INFO - deepke.relation_extraction.multimodal.models.clip.file_utils - PyTorch version 1.10.0 available. 07/07/2022 16:05:08 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 07/07/2022 16:05:08 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . C:\Python38\lib\site-packages\hydra\plugins\config_source.py:190: UserWarning: Missing @package directive hydra/output/custom.yaml in file://C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard\conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) C:\Python38\lib\site-packages\hydra\plugins\config_source.py:190: UserWarning: Missing @package directive train.yaml in file://C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard\conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) C:\Python38\lib\site-packages\hydra\plugins\config_source.py:190: UserWarning: Missing @package directive predict.yaml in file://C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard\conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) [2022-07-07 16:05:08,405][pytorch_transformers.modeling_utils][INFO] - loading configuration file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/config.json [2022-07-07 16:05:08,405][pytorch_transformers.modeling_utils][INFO] - Model config { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "finetuning_task": "ner", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "num_labels": 10, "output_attentions": false, "output_hidden_states": false, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "pruned_heads": {}, "torchscript": false, "type_vocab_size": 2, "vocab_size": 21128 }

[2022-07-07 16:05:08,406][pytorch_transformers.modeling_utils][INFO] - loading weights file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/pytorch_model.bin [2022-07-07 16:05:09,848][pytorch_transformers.tokenization_utils][INFO] - Model name 'C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming 'C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/' is a path or url to a directory containing tokenizer files. [2022-07-07 16:05:09,849][pytorch_transformers.tokenization_utils][INFO] - loading file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/vocab.txt [2022-07-07 16:05:09,849][pytorch_transformers.tokenization_utils][INFO] - loading file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/added_tokens.json [2022-07-07 16:05:09,849][pytorch_transformers.tokenization_utils][INFO] - loading file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/special_tokens_map.json [2022-07-07 16:05:09,849][pytorch_transformers.tokenization_utils][INFO] - loading file C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\example\ner\standard/checkpoints/tokenizer_config.json NER句子: 地对地导弹的例子包括:MGM-140 ATACMS、地面发射的GBU-39小直径炸弹 、远程精确射击和东风系列导弹。 NER结果: Traceback (most recent call last): File "C:/Users/Administrator/Desktop/project1-jiaohe/DeepKE-main/example/ner/standard/predict.py", line 13, in main result = model.predict(text) File "C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\src\deepke\name_entity_re\standard\models\InferBert.py", line 111, in predict labels = [(self.label_map[label],confidence) for label,confidence in logits] File "C:\Users\Administrator\Desktop\project1-jiaohe\DeepKE-main\src\deepke\name_entity_re\standard\models\InferBert.py", line 111, in labels = [(self.label_map[label],confidence) for label,confidence in logits] KeyError: 0

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

进程已结束,退出代码1

tlk1997 commented 2 years ago

出现这问题是因为预测的时候label预测成了0,而定义的lables下标是从1开始的,简单讲就是训练数据太少了,模型没训好

callmeiron commented 2 years ago

请问大概需要多少数据才能较好的训练呢,因为我是手动标注,可能标不了太多

zxlzr commented 2 years ago

一般情况看数据复杂程度吧,一千条以上标注数据可能会好一些,不过您可以采用词典匹配的方式生成一些弱标注数据,这样可能可以减少人力成本

callmeiron commented 2 years ago

好的,十分感谢!