yuzhimanhua / Multi-BioNER

Cross-type Biomedical Named Entity Recognition with Deep Multi-task Learning (Bioinformatics'19)
https://arxiv.org/abs/1801.09851
Apache License 2.0
132 stars 28 forks source link

How to output only one result file during prediction? #22

Closed zhongxiangboy closed 2 years ago

zhongxiangboy commented 2 years ago

When using the pre-training model provided for prediction, five result files are output (which seem to correspond to the five datasets used for training).

So, how to output only one result file?

Do I need to integrate all five data sets into one, and then use the model trained by the integrated data to predict?

yuzhimanhua commented 2 years ago

Yes, when you have N training datasets, there will be N output files corresponding to the N datasets. This is because we are doing multi-task learning with each dataset as a task. Note that these N output files may have conflicts (e.g., the same token may be predicted as S-GENE in output 1 but S-CHEMICAL in output 2). Outputting only 1 file (with conflicts resolved) is beyond the scope of this project.

Merging all training sets into one cannot work because it will introduce lots of false-negative training samples. For example, the first training set may only have GENE entities, then all CHEMICAL entities in the first training set will be labeled as "O".

To achieve the goal you are expecting, as far as I know, you may refer to the following paper:

Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets. paper: https://aclanthology.org/D18-1306.pdf code: https://github.com/ngreenberg/em-crf