Released code for the paper Learning to Watermark LLM-generated Text via Reinforcement Learning
Cite:
@article{xu2024learning,
title={Learning to Watermark LLM-generated Text via Reinforcement Learning},
author={Xu, Xiaojun and Yao, Yuanshun and Liu, Yang},
journal={arXiv preprint arXiv:2403.10553},
year={2024}
}
pip install -r requirements.txt
.We can watermark two types of models: completion models and Q&A (instruction-finetuned) models.
Train a watermarked OPT-1.3b model with a paired OPT-350m detector on the c4 dataset:
python pretrain_detector.py --model opt-350m --dataset c4 --gen_dataset # Pretraining step for the detector
deepspeed --num_gpus 1 main.py --actor_model opt-1.3b --reward_model opt-350m --do_sample --use_lora --with_tensorboard
Other settings:
opt-1.3b
with llama2-7b
and replace opt-350m
with llama2-1.1b
.--lr 0 --lora_lr 0
.--substitute_ratio 0.2
.--paraphraser pegasus1.5
.H+L
in Table 2), set --other_llm llama2-7b
.There are two extra steps when adding watermark during the alignment tasks (experiments using PKU alignment data in the paper). First, we need to SFT the model to follow the first step in the conventional alignment pipeline:
python pretrain_sft.py --model opt-1.3b --learn_steps 10000 --use_lora
Next, we pretrain the detector as before:
python pretrain_detector.py --model opt-350m --dataset PKU --gen_dataset # Pretraining step for the detector
Then we need a reward model to RLHF the model and embed the watermark while training the detector. You can follow the script in DeepSpeed examples to train a reward model. Alternatively, you can write a script similar to pretrain_detector.py
to get a model checkpoint for reward models on the PKU dataset.
Either way, the next step assumes you have put the reward model checkpoint under ./deepspeed_ckpt/opt-350m
. Then we run the co-training:
deepspeed --num_gpus 1 main_in_alignment.py --actor_model opt-1.3b --reward_model opt-350m --do_sample --use_lora --with_tensorboard --rlhf_wtm_lamda 0.5
Other settings are the same as before.
The training script above should return the detection AUC without perturbation. To evaluate the model performance under different perturbation:
python evaluate.py --dataset {dataset} --model_path {path_to_model}