Data and source Code for the paper "Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs".
Nowadays, Automatic Keyphrase Extraction (AKE) with single eye-tracking source is constrained by physiological mechanism, signal processing techniques and other factors. In this paper, we propose to utilize EEG and eye-tracking signals to enhance AKE from Microblogs. Our work includes the followig aspects:
The results verified the enhancement of cognitive signals genarated during human reading on AKE. EEG signals exhibit the most significant improvement, while the combined results showed no further enhancement. T5-Large model can maximize the performance of the model without weakening the cognitive signals’ weights.
AKE Root directory ├── dataset Experimental datasets │ ├── ZUCO Cognitive datasets │ │ ├── test │ │ └── train │ └── Microblogs Microblogs based AKE datasets │ ├── Election-Trec Election-Trec AKE Dataset │ │ ├── test │ │ └── train │ └── General-Twitter General-Twitter AKE Dataset │ ├── test │ └── train ├── models Module of the deep learning models and pre-trained models │ ├── pretrain_pt Path to store pre-trained model parameters │ │ ├── bert.pt │ │ └── t5.pt │ ├── BILSTM.py Baseline model │ ├── ATT-BILSTM.py soft attention based Bi-LSTM │ ├── SATT-BILSTM.py self-attention based Bi-LSTM │ ├── ATT-BILSTM+CRF.py soft attention based Bi-LSTM+CRF │ ├── SATT-BILSTM+CRF.py self-attention based Bi-LSTM+CRF │ ├── SATT-BILSTM+CRF+GloVe.py Improved model with GloVe Embeddings │ ├── BERT.py Improved model based on BERT model │ └── T5.py Improved model based on T5 model ├── result Path to store the results │ ├── Election-Trec │ └── General-Twitter ├── config.py Path configuration file ├── utils.py Some auxiliary functions ├── evaluate.py Surce code for result evaluation ├── processing.py Source code of preprocessing function ├── main.py Surce code for main function └─README.md
In our study, two kinds of data are used: the cognitive signal data from human readings behaviors and the AKE from Microblogs data.
In this study, we choose the Zurich Cognitive Language Processing Corpus (ZUCO), which captures eye-tracking signals and EEG signals of 12 adult native speakers reading approximately 1100 English sentences in normal and task reading modes. The raw data can be visited at: https://osf.io/2urht/.
Only data from the normal reading mode were utilized to align with human natural reading habits. The reading corpus includes two datasets: 400 movie reviews from the Stanford Sentiment Treebank and 300 paragraphs about celebrities from the Wikipedia Relation Extraction Corpus. We release our all train and test data in “dataset” directory, In the ZUCO dataset, cognitive features have been spliced between each word and the corresponding label.
Specifically, there are 17 Eye-tracking features and 8 EEG features were extracted from the dataset:
Election-Trec Dataset
The Election-Trec dataset4 is derived from the open-source dataset TREC2011 track4. The raw data can be visited at: https://trec.nist.gov/data/tweets/. After removing all "#" symbols, it contains 24,210 training tweets and 6,054 testing tweets.
General-Twitter Dataset
Developed by (Zhang et al., 2016), employs Hashtags as keyphrases for each tweet. The raw data can be visited at: http://qizhang.info/paper/data/keyphrase_dataset.tar.gz. It consists of 78,760 training tweets and 33,755 testing tweets, with an average sentence length of about 13 words. Empty lines indicate a sentence break, and one consecutive paragraph represents a sentence.
System environment is set up according to the following configuration:
Processing: Run the processing.py file to process the data into json format:
python processing.py
The data is preprocessed to the format like: {['word','Value_et1',... ,'Value_et17','Value_eeg1',... ,'Value_eeg8','tag']}
Configuration: Configure hyperparameters in the config.py
file. There are roughly the following parameters to set:
modeltype
: select which model to use for training and testing.train_path
,test_path
,vocab_path
,save_path
: path of train data, test data, vocab data and results.fs_name
, fs_num
: Name and number of cognitive traits.run_times
: Number of repetitions of training and testing.epochs
: refers to the number of times the entire training dataset is passed through the model during the training process. lr
: learning rate.vocab_size
: the size of vocabulary. 37347 for Election-Trec Dataset, 85535 for General-Twitter.embed_dim
,hidden_dim
: dim of embedding layer and hidden layer.batch_size
: refers to the number of examples (or samples) that are processed together in a single forward/backward pass during the training or inference process of a machine learning model.max_length
: is a parameter that specifies the maximum length (number of tokens) allowed for a sequence of text input. It is often used in natural language processing tasks, such as text generation or text classification.Modeling: Modifying combinations of additive cognitive features in the model.
For example, the code below means add all 25 features into the model:
input = torch.cat([input, inputs['et'], inputs['eeg']], dim=-1)
Training and testing: based on your system, open the terminal in the root directory 'AKE' and type this command:
python main.py
model_type = 8
to call for the BERT model:
outputs = torch.concat((bert_outputs,extra_features[:,:,:]),-1)
.models/pretrain_pt
.models/pretrain_pt
.model_type = 9
to call for the T5 model:
model construction
part: outputs = torch.concat((T5_outputs,extra_features[:,:,:]),-1)
. models/pretrain_pt
.models/pretrain_pt
.model_type = 9
to call for the T5 model:
We randomly selected five instances from the Election-Trec dataset and the General-Twitter dataset to visually illustrate the impact of cognitive signals generated during human reading on AKE from Microblogs (refer to Table 3 for details).
In this study, we compared the performance of the AKE under four feature combinations: "-," "EEG," "ET," and "ET&EEG". "-" indicates the model without using any cognitive processing signals. "EEG" and "ET" represent the model with only EEG signals and only eye-tracking signals, respectively. "ET&EEG" indicates the model that combines both eye-tracking and EEG signals simultaneously.
Note: Bold italicize mark indicates annotated correct Hashtags in microblog manually , blue mark represents predicted keyphrases correctly, green mark indicates predicted incorrect results, yellow mark represents partially predicted words for the target answers.
In order to compare the evaluation results more intuitively, we used the following scoring criteria: 10 points for correct predictions, 3 points for partially correct predictions, and 0 points for incorrect predictions. The scores for each feature combination were as follows:" - : 12 points, EEG : 86 points, ET : 29 points, and ET&EEG : 53 points". These results clearly indicate that cognitive signals generated during human reading have a positive impact on the AKE from Microblogs. Among them, EEG signals show a stronger enhancement on AKE performance, while eye-tracking signals exhibit a relatively weaker enhancing capability.
Please cite the following paper if you use this code and dataset in your work.
Xinyi Yan, Yingyi Zhang, Chengzhi Zhang. Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs. Information Processing and Management, 2024, 61(2): 103614. [doi] [Dataset & Source Code]