xiaojunxu / learning-to-watermark-llm

MIT License
11 stars 0 forks source link

Learning to Watermark LLM

Released code for the paper Learning to Watermark LLM-generated Text via Reinforcement Learning

Cite:

@article{xu2024learning,
  title={Learning to Watermark LLM-generated Text via Reinforcement Learning},
  author={Xu, Xiaojun and Yao, Yuanshun and Liu, Yang},
  journal={arXiv preprint arXiv:2403.10553},
  year={2024}
}

Prerequisites

We can watermark two types of models: completion models and Q&A (instruction-finetuned) models.

Watermarking Prompt Completion Models

Train a watermarked OPT-1.3b model with a paired OPT-350m detector on the c4 dataset:

python pretrain_detector.py --model opt-350m --dataset c4 --gen_dataset  # Pretraining step for the detector
deepspeed --num_gpus 1 main.py --actor_model opt-1.3b --reward_model opt-350m --do_sample --use_lora --with_tensorboard

Other settings:

Watermarking Q&A Models together with Alignment

There are two extra steps when adding watermark during the alignment tasks (experiments using PKU alignment data in the paper). First, we need to SFT the model to follow the first step in the conventional alignment pipeline:

python pretrain_sft.py --model opt-1.3b --learn_steps 10000 --use_lora

Next, we pretrain the detector as before:

python pretrain_detector.py --model opt-350m --dataset PKU --gen_dataset  # Pretraining step for the detector

Then we need a reward model to RLHF the model and embed the watermark while training the detector. You can follow the script in DeepSpeed examples to train a reward model. Alternatively, you can write a script similar to pretrain_detector.py to get a model checkpoint for reward models on the PKU dataset.

Either way, the next step assumes you have put the reward model checkpoint under ./deepspeed_ckpt/opt-350m. Then we run the co-training:

deepspeed --num_gpus 1 main_in_alignment.py --actor_model opt-1.3b --reward_model opt-350m --do_sample --use_lora --with_tensorboard --rlhf_wtm_lamda 0.5

Other settings are the same as before.

Evaluation

The training script above should return the detection AUC without perturbation. To evaluate the model performance under different perturbation:

python evaluate.py --dataset {dataset} --model_path {path_to_model}