yuboona / Chinese-Punctuation-Restoration-with-Bert-CNN-RNN

A Bert-CNN-LSTM model for punctuation restoration
GNU General Public License v3.0
55 stars 8 forks source link

Chinese-Punctuation-Restoration-with-Bert-CNN-RNN


This repository is developed from a backbone in repo BertPunc. Then, We impleted our original ideas of Word-level BERT-CNN-RNN Model for Chinese Punctuation Restoration


Requirment

torch==1.1.0
numpy==1.19.4
scikit_learn==0.23.2
tqdm==4.54.1
transformers==4.0.1
pip install -r requirements.txt

1. Difference from My Previous Repo

Previous work consists a simple BiLSTM network for punctuation restoration. We then tried integrating CNN with the BiLSTM and attention. However, CNN and Attention didn't show any improvement for Chinese Punctation. A seq to seq mechanism also performed baddly on Chinese punctuation restoration task.

In this work, we bring the bert.But bert has been widly used in many works, for acheive a more meaningful work, we bring the insight of word-level concept in our work.

Bert and it's variants rely a character tokenizer of Chinese. Unlike English word tokenizer remaining mostly the word semantic of the english word. Chinese tokenizer just split chinese sentence into characters which don't always represent a complete semantic. It will greatly influnce the model's capability. As you can easily imaging, when a pretrained model doing a task related fine-tuning, using a character tokenizer will make model concentrating more on the character information. Some word level relation even will be forgot.

2. Methods Details

Our model use two types features for final punctuation predictions:

  1. word-level features: Well designed CNN layers. As Figure 1.
  2. Character level features: Bert outputs. As Figure 2.
Figure 1. Word-Level features
word_level_features
Figure 2. Character-level features
char_level_features

3. Code

4. Experiments Results

We conducted experiments on Chinese Ted talks transcripts in IWSLT2012. Besides, We also conducted on other datasets like PeopleDaliy and reading books. The results can vary. The text with good grammar can acheive better results. The voice transcripts performed lower results. There are more works to do.

Figure 3. Experiments Results
experiments_results