This repository implements a prompt tuning model for hierarchical text classification. This work has been accepted as the long paper HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification in EMNLP 2022.
Please download the original dataset and then use these scripts.
The original dataset can be acquired in the repository of HDLTex. Preprocessing code could refer to the repository of HiAGM and we provide a copy of preprocessing code here.
Please save the Excel data file Data.xlsx
in WebOfScience/Meta-data
as Data.txt
.
cd data/WebOfScience
python preprocess_wos.py
python data_wos.py
The original dataset can be acquired here.
Place the unzipped folder nyt_corpus
inside data/nyt
(or unzip nyt_corpus_LDC2008T19.tgz
inside data/nyt
).
cd data/nyt
# unzip if necessary
# tar -zxvf nyt_corpus_LDC2008T19.tgz -C ./
python data_nyt.py
The preprocessing code could refer to the repository of reuters_loader and we provide a copy here. The original dataset can be acquired here by signing an agreement.
Place rcv1.tar.xz
and lyrl2004_tokens_train.dat
(can be downloaded here) inside data/rcv1
.
cd data/rcv1
python preprocess_rcv1.py ./
python data_rcv1.py
usage: train.py [-h] [--lr LR] [--data DATA] [--batch BATCH] [--early-stop EARLY_STOP] [--device DEVICE] --name NAME [--update UPDATE] [--model MODEL] [--wandb] [--arch ARCH] [--layer LAYER] [--graph GRAPH] [--prompt-loss]
[--low-res] [--seed SEED]
optional arguments:
-h, --help show this help message and exit
--lr LR Learning rate. Default: 3e-5.
--data {WebOfScience,nyt,rcv1} Dataset.
--batch BATCH Batch size.
--early-stop EARLY_STOP Epoch before early stop.
--device DEVICE cuda or cpu. Default: cuda.
--name NAME A name for different runs.
--update UPDATE Gradient accumulate steps.
--wandb Use wandb for logging.
--seed SEED Random seed.
Checkpoints are in ./checkpoints/DATA-NAME
. Two checkpoints are kept based on macro-F1 and micro-F1 respectively
(checkpoint_best_macro.pt
, checkpoint_best_micro.pt
).
Example:
python train.py --name test --batch 16 --data WebOfScience
We experiment on one GeForce RTX 3090 GPU (24G) with CUDA version $11.2$. We use a batch size of $16$ to fully tap one GPU.
The model is trained for around 20 epochs before an early stop with ~10 min/epoch.
Our model has no extra hyperparameters: all hyperparameters follow previous works and have not been tuned.
Checkpoints for each dataset can be downloaded here. Place the checkpoints
folder inside the main folder (HPT/checkpoints
). These results are reported in the main experiment.
Dataset | Macro-F1 | Micro-F1 |
---|---|---|
NYT | 0.7041934624814794 | 0.8041512855978236 |
RCV1-V2 | 0.6953327068021089 | 0.8726110320904367 |
Web Of Science | 0.8192644031945633 | 0.8715855067014047 |
usage: test.py [-h] [--device DEVICE] [--batch BATCH] --name NAME [--extra {_macro,_micro}]
optional arguments:
-h, --help show this help message and exit
--device DEVICE
--batch BATCH Batch size.
--name NAME Name of checkpoint. Commonly as DATA-NAME.
--extra {_macro,_micro}
An extra string in the name of checkpoint. Default: _macro
Use --extra _macro
or --extra _micro
to choose from using checkpoint_best_macro.pt
orcheckpoint_best_micro.pt
respectively.
e.g. Test on previous example.
python test.py --name WebOfScience-test --batch 64
Test on provided checkpoints:
python test.py --name WebOfScience-HPT --batch 64
python test.py --name rcv1-HPT --batch 64
python test.py --name nyt-HPT --batch 64
@inproceedings{wang-etal-2022-hpt,
title = "{HPT}: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification",
author = "Wang, Zihan and
Wang, Peiyi and
Liu, Tianyu and
Lin, Binghuai and
Cao, Yunbo and
Sui, Zhifang and
Wang, Houfeng",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.246",
doi = "10.18653/v1/2022.emnlp-main.246",
pages = "3740--3751",
}