NerCo: A Contrastive Learning based Two-stage Chinese NER Method

This repo provides the source codes & data for our paper "NerCo: A Contrastive Learning based Two-stage Chinese NER Method " .

Overview

NerCo is our proposed two-stage learning approach for tackling Entity Representation Segmentation in Label-semantics. Unlike traditional sequence labeling methods which lead to the above problem, our approach takes a two-stage NER strategy. In the first stage, we conduct contrastive learning for label-semantics based representations. Then we finetune the learned model in the second stage, equipping it with inner-entity position discrimination for chunk tags and linear mapping to type tags for each token. Our codes are modified from the baseline Flat, so we recommend that you read their codes in advance to better understand ours. And we conducted our experiments on a NVIDIA A100 80G GPU for training.

Figure1: A comparison between traditional sequence labeling methods and our proposed method NerCo.

Figure2: Contrastive representation learning as the first stage of NerCo.

Dependencies

source activate # use conda
conda create --name nerco python=3.7.3 # create a virtual enviroment named nerco
conda activate nerco # activate
pip3 install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio==0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html # torch
pip3 install -r requirements.txt # requirement file
cp fastnlp_src/* ~/.conda/envs/nerco/lib/python3.7/site-packages/fastNLP/core/. # overwrite fastnlp source

Data Preparation

Download the character embeddings and word embeddings(Provided by Flat). Put them into data/word subdirectory.

Character and Bigram embeddings (gigaword_chn.all.a2b.{'uni' or 'bi'}.ite50.vec) : Google Drive or Baidu Pan

Word(Lattice) embeddings:

yj, (ctb.50d.vec) : Google Drive or Baidu Pan

ls, (sgns.merge.word.bz2) : Baidu Pan

Preprocess and merge the lexicon file and the char file into a mixed char_and_word file.
```
cd data
python preprocess.py
```

2. Datasets.

Download the datasets here (with MSRA train/test splits preprocessed). For Ontonotes, you could download the dataset and also preprocess the train split. (Due to copyright and permission reasons, we are unable to post our processed Ontonotes dataset.) See Flat for more details of preprocessing MSRA and Ontonotes. Put each dataset into data/datasets/dataName(e.g. data/datasets/weibo for Weibo NER dataset).

Evaluate

You can evaluate our trained checkpoints(download here). Put each dataset checkpoint into checkpoints/dataName and directly execute the following commands:

cd evaluate
python weibo.py #taking weibo dataset for evaluation example

Or you can train the models from scratch(see next section) and modify the corresponding model path in the python scripts for evaluation.

Train

cd train
python weibo.py #taking weibo for training example

Model Performance:

Datasets	Resume	Weibo	MSRA	Ontonotes
Test F1	0.968196	0.727924	0.962927	0.836158

Acknowledgements

The codes are based on the codes of Flat, thanks a lot!

zhzai / nerco