yan-xinyi / AKE

Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs
1 stars 1 forks source link

Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs

Overview

Data and source Code for the paper "Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs".

Nowadays, Automatic Keyphrase Extraction (AKE) with single eye-tracking source is constrained by physiological mechanism, signal processing techniques and other factors. In this paper, we propose to utilize EEG and eye-tracking signals to enhance AKE from Microblogs. Our work includes the followig aspects:

The results verified the enhancement of cognitive signals genarated during human reading on AKE. EEG signals exhibit the most significant improvement, while the combined results showed no further enhancement. T5-Large model can maximize the performance of the model without weakening the cognitive signals’ weights.

Directory Structure

AKE                                          Root directory
├── dataset                                  Experimental datasets
│   ├── ZUCO                                 Cognitive datasets
│   │    ├── test
│   │    └── train
│   └── Microblogs                           Microblogs based AKE datasets
│        ├── Election-Trec                   Election-Trec AKE Dataset
│        │     ├── test
│        │     └── train
│        └── General-Twitter                 General-Twitter AKE Dataset                 
│              ├── test
│              └── train
├── models                                   Module of the deep learning models and pre-trained models
│   ├── pretrain_pt                          Path to store pre-trained model parameters
│   │    ├── bert.pt
│   │    └── t5.pt
│   ├── BILSTM.py                            Baseline model
│   ├── ATT-BILSTM.py                        soft attention based Bi-LSTM
│   ├── SATT-BILSTM.py                       self-attention based Bi-LSTM
│   ├── ATT-BILSTM+CRF.py                    soft attention based Bi-LSTM+CRF
│   ├── SATT-BILSTM+CRF.py                   self-attention based Bi-LSTM+CRF
│   ├── SATT-BILSTM+CRF+GloVe.py             Improved model with GloVe Embeddings
│   ├── BERT.py                              Improved model based on BERT model
│   └── T5.py                                Improved model based on T5 model 
├── result                                   Path to store the results
│   ├── Election-Trec
│   └── General-Twitter
├── config.py                                Path configuration file
├── utils.py                                 Some auxiliary functions
├── evaluate.py                              Surce code for result evaluation
├── processing.py                            Source code of preprocessing function
├── main.py                                  Surce code for main function
└─README.md

Dataset Discription

In our study, two kinds of data are used: the cognitive signal data from human readings behaviors and the AKE from Microblogs data.

1. Cognitive Signal Data -- ZUCO Dataset

In this study, we choose the Zurich Cognitive Language Processing Corpus (ZUCO), which captures eye-tracking signals and EEG signals of 12 adult native speakers reading approximately 1100 English sentences in normal and task reading modes. The raw data can be visited at: https://osf.io/2urht/.

Only data from the normal reading mode were utilized to align with human natural reading habits. The reading corpus includes two datasets: 400 movie reviews from the Stanford Sentiment Treebank and 300 paragraphs about celebrities from the Wikipedia Relation Extraction Corpus. We release our all train and test data in “dataset” directory, In the ZUCO dataset, cognitive features have been spliced between each word and the corresponding label.

Specifically, there are 17 Eye-tracking features and 8 EEG features were extracted from the dataset:

Table 1. Summary of Eye-Tracking Features Table 1. Summary of Eye-Tracking Features
Table 2. Summary of EEG Features
Table 2. Summary of EEG Features

2. AKE Dataset

Requirements

System environment is set up according to the following configuration:

Quick Start

Implementation Steps for Bi-LSTM-based AKE

  1. Processing: Run the processing.py file to process the data into json format: python processing.py

    The data is preprocessed to the format like: {['word','Value_et1',... ,'Value_et17','Value_eeg1',... ,'Value_eeg8','tag']}

  2. Configuration: Configure hyperparameters in the config.py file. There are roughly the following parameters to set:

    • modeltype: select which model to use for training and testing.
    • train_path,test_path,vocab_path,save_path: path of train data, test data, vocab data and results.
    • fs_name, fs_num: Name and number of cognitive traits.
    • run_times: Number of repetitions of training and testing.
    • epochs: refers to the number of times the entire training dataset is passed through the model during the training process.
    • lr: learning rate.
    • vocab_size: the size of vocabulary. 37347 for Election-Trec Dataset, 85535 for General-Twitter.
    • embed_dim,hidden_dim: dim of embedding layer and hidden layer.
    • batch_size: refers to the number of examples (or samples) that are processed together in a single forward/backward pass during the training or inference process of a machine learning model.
    • max_length: is a parameter that specifies the maximum length (number of tokens) allowed for a sequence of text input. It is often used in natural language processing tasks, such as text generation or text classification.
  3. Modeling: Modifying combinations of additive cognitive features in the model. For example, the code below means add all 25 features into the model: input = torch.cat([input, inputs['et'], inputs['eeg']], dim=-1)

  4. Training and testing: based on your system, open the terminal in the root directory 'AKE' and type this command: python main.py

Implementation Steps for Large Language Models(LLMs)-based AKE

  1. BERT: choose model_type = 8 to call for the BERT model:
    • Cognitive signals added in the model construction: outputs = torch.concat((bert_outputs,extra_features[:,:,:]),-1).
    • Set epoch to 5 and train the model. Save the model parameter with the best F1 value to the path under models/pretrain_pt.
    • When testing, the model parameters are read from models/pretrain_pt.
  2. T5-Base: choose model_type = 9 to call for the T5 model:
    • Set parameter weight = 't5-base'.
    • Cognitive signals are added in the model construction part: outputs = torch.concat((T5_outputs,extra_features[:,:,:]),-1).
    • Set epoch to 5 and train the model. Save the model parameter with the best F1 value to the path under models/pretrain_pt.
    • When testing, the model parameters are read from models/pretrain_pt.
  3. T5-Large: choose model_type = 9 to call for the T5 model:
    • Unlike t5-Base, set parameter weight = 't5-large'.
    • Other steps are similar to the above.

Case Study

We randomly selected five instances from the Election-Trec dataset and the General-Twitter dataset to visually illustrate the impact of cognitive signals generated during human reading on AKE from Microblogs (refer to Table 3 for details).

In this study, we compared the performance of the AKE under four feature combinations: "-," "EEG," "ET," and "ET&EEG". "-" indicates the model without using any cognitive processing signals. "EEG" and "ET" represent the model with only EEG signals and only eye-tracking signals, respectively. "ET&EEG" indicates the model that combines both eye-tracking and EEG signals simultaneously.

Table 3. Example of AKE incorporating Cognitive Signals Generated during Human Reading
Table 3. Example of AKE incorporating Cognitive Signals Generated during Human Reading

Note: Bold italicize mark indicates annotated correct Hashtags in microblog manually , blue mark represents predicted keyphrases correctly, green mark indicates predicted incorrect results, yellow mark represents partially predicted words for the target answers.

In order to compare the evaluation results more intuitively, we used the following scoring criteria: 10 points for correct predictions, 3 points for partially correct predictions, and 0 points for incorrect predictions. The scores for each feature combination were as follows:" - : 12 points, EEG : 86 points, ET : 29 points, and ET&EEG : 53 points". These results clearly indicate that cognitive signals generated during human reading have a positive impact on the AKE from Microblogs. Among them, EEG signals show a stronger enhancement on AKE performance, while eye-tracking signals exhibit a relatively weaker enhancing capability.

Citation

Please cite the following paper if you use this code and dataset in your work.

Xinyi Yan, Yingyi Zhang, Chengzhi Zhang. Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs. Information Processing and Management, 2024, 61(2): 103614. [doi] [Dataset & Source Code]