DOC: Improving Long Story Coherence With Detailed Outline Control

Update 10/26/23: See https://github.com/facebookresearch/doc-storygen-v2 for a version of the code with prompts supporting using newer chat models (e.g., LLaMA-2, ChatGPT). It follows the same high level structure, but isn't exactly the same behavior in all places (e.g., some pieces are removed for simplicity, and some heuristic checks are no longer necessary); our main goal with the rewite was to make the code easier to work with / modify.

This repo contains code for DOC: Improving Long Story Coherence With Detailed Outline Control (https://arxiv.org/abs/2212.10077, ACL 2023) by Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian. In this codebase we provide instructions for automatically generating longer stories (avg 3500+ words in our paper experiments). DOC's stories are judged by human annotators as substantially more coherent, relevant, and interesting compared to those written by our previous system, Re3 (https://github.com/yangkevin2/emnlp22-re3-story-generation).

Installation / Data

(1) Install Python 3.8.15 and PyTorch 1.13.1 (slightly older/newer versions are probably also fine for both). (2) Install the remaining requirements via pip install -r requirements.txt. You may need to also run pip install -U sentence-transformers if you get crashes related to huggingface_hub.snapshot_download later. If you get some issue with numpy versions, try version 1.22.4. (3) Install this repo with pip install -e ..

Also run export OPENAI_API_KEY=$YOUR_API_KEY in your terminal so that the code can call the GPT3 API with your key.

Meanwhile, run wget https://doc-story-generation-data.s3.amazonaws.com/doc_data.zip and unzip the folder to the top level of this repo. This folder contains pretrained controller/reranker ckpts and data used to train controllers/rerankers.

To get the final generated stories and Surge AI annotation results from our main experiments run wget https://doc-story-generation-data.s3.amazonaws.com/doc_outputs.zip (note: some generated stories may contain sensitive/NSFW content, since we didn't attempt to filter these).

Plan + Outline Generation

We first generate the plan/outline before moving on to the main story.

Plan + Outline Generation Command

Example plan generation command matching the settings used for our main paper experiments:

mkdir output
CUDA_VISIBLE_DEVICES=0 python -u scripts/main.py --controller none none none longformer_classifier --loader none none none order --controller-load-dir none none none doc_data/ckpt/outline_order_reranker --controller-model-string none none none roberta-large --no-editor --setup-only --outline-levels 3 --save-outline-file output/plan.pkl --log-file output/plan.log

Code assumes the outline order reranker is in the 4th position of the argument, so don't change those parts of the command. Don't worry if you see some errors being printed, as long as the program doesn't terminate early; some parts might need multiple tries.

This command uses our existing reranker ckpts included in the download. If you want to use your own ckpts, see the instructions further down for training, and change the paths in this command to point to the correct ckpts.

Generating a plan with these settings costs a couple of dollars on GPT3.

Other Arguments

Plan generation arguments are compiled in scripts/main.py; follow the links there to see a complete list. Some particular arguments of interest:

Specify the --premise argument to specify your own story premise instead of having one autogenerated by GPT3.
Change --outline-levels to change the maximum depth of the outline.
Set --outline-char-model-string to a different InstructGPT3 model (e.g., text-curie-001) to save a sizable chunk (if not most) of the GPT3 cost in exchange for slightly worse performance when detecting characters for the outline.
Use --outline-restart-pkl to continue generation from a previously-generated lower-depth pkl file. (We use this functionality for our human-interactive experiments.)
Set --log-level to be something between 21 and 25 to vary the verbosity of logging (higher = less verbose; defaults to 25).

Main Story Generation

After generating a plan according to the previous instructions, we can generate the story.

OPT-175B Setup

Our main story generation uses OPT-175B served using Alpa (https://alpa.ai/), since it allows token-level logit modification to run controlled generation approaches such as DOC's detailed controller as described in the paper. You have a few options here.

1. Free Public Alpa OPT-175B API (easy to get started; high quality, may be slower)

You can ask the Alpa folks for a key to call their free public API at https://opt.alpa.ai/ (slack link at the bottom). They're really nice. This option may be slower (in runtime) depending on your physical location, since their servers are in the Middle East. (We need to access the logprobs endpoint, not the default completions one.)

Once you have a key, you can specify --alpa-url https://opt.alpa.ai --alpa-key YOUR_KEY in the main story command below.

2. Self-Serve (faster + high-quality; need a lot of compute)

If you have the compute, you can request the weights from Meta (https://forms.gle/BDB2i44QwCr2mCJN6) and serve it yourself using Alpa. This is the best (high-quality and reasonable speed) option if you can do it.

Follow the installation and serving instructions at https://alpa.ai/install.html and https://alpa.ai/tutorials/opt_serving.html respectively. The newest version of Alpa should work, but we also froze the version we used at https://github.com/yangkevin2/doc-alpa in case it's useful.

Once you have it set up, specify --alpa-url YOUR_SERVER_URL in the main story command below (e.g., in the format http://0.0.0.0:8001).

Alternatively you can use a smaller OPT model, though this will result in noticeably worse quality.

3. GPT3-175B (easiest to get started and fastest; worse quality)

Just use GPT3-175B instead, which means turning off our detailed controller. You will on average get noticeably worse faithfulness to the plan/outline, but it'll be quite a bit faster.

To do this, set --extension-method gpt3 in the main story command below. This will use the base davinci model (i.e., not one of the instruction-tuned GPT3.5/GPT4 models, which use a different prompting interface and aren't currently supported; these instruction-tuned models also often write in a somewhat different style).

It's not too expensive as far as the GPT3 API is concerned; you'll probably spend less than a dollar over the course of the story.

Main Story Generation Command

After setting up your OPT-175B (or other) server, run the following to draft the story using the same settings as in our main paper experiments, making sure to append the extra Alpa-related (or other) arguments described above.

CUDA_VISIBLE_DEVICES=0 python -u scripts/main.py {{{ALPA_ARGS}}} --controller longformer_classifier longformer_classifier fudge_controller --loader alignment coherence fine_coherence --controller-load-dir doc_data/ckpt/relevance_reranker doc_data/ckpt/coherence_reranker doc_data/ckpt/detailed_controller --controller-model-string allenai/longformer-base-4096 allenai/longformer-base-4096 facebook/opt-350m --load-outline-file output/plan.pkl --no-editor --include-future-context --control-strength 1 1 0 --control-strength-substep-increment 3 --save-complete-file output/story.pkl --log-file output/story.log

The command assumes all 3 rerankers/controllers are present in the specified order, so don't change those arguments.

Although this command still uses GPT3 to write some summaries for prompting, the costs are on the order of a few cents.

Other Arguments

Main story generation arguments are also compiled in scripts/main.py; follow the links there to see a complete list. Some particular arguments of interest:

Change --max-continuation-substeps (defaults to 8) and --max-tokens (defaults to 64) to change how much maximum story text to write for each numbered item of the outline. With the default settings, it will write up to eight 64-token passages for each.
Change --early-stop-threshold and --skip-threshold to mess with the early stopping heuristics for moving drafting to the next outline item. Smaller (more negative) values of --early-stop-threshold will result in more aggressive early stopping. Larger (less negative) values of --skip-threshold will result in more frequently skipping directly to the next outline item when all generated passage candidates aren't very good.
--control-strength has three numbers corresponding to the relevance reranker, coherence reranker, and detailed controller respectively. The detailed controller's control strength increases over time according to --control-strength-substep-increment up to --max-control-strength while drafting for a given outline item, resetting when we move to the next outline item. We think the current settings are a reasonable balance of control vs. letting the model be creative, but feel free to tweak. To turn off the detailed controller just use --control-strength-substep-increment 0.
The frequency and prompt repetition penalties during generation are set to 1 (with 0.98 exponential decay per token). You can change --summarizer-frequency-penalty, --summarizer-prompt-penalty, and --summarizer-frequency-penalty-decay respectively. Other arguments related to the base generator are in story-generation/common/summarizer/summarizer_util.py.
If you have a high-depth outline and you want to generate using lower depth (e.g. convert a depth 3 outline to a depth 2 outline), specify --generation-outline-levels.
Increase --max-beam-size (defaults to 1) to turn on a passage-level variable-size beam search procedure based on the rerankers. This is off for the paper experiments (makes the system several times slower).
If you run out of GPU memory you can try decreasing --fudge-batch-size to e.g. 32 (or less), or retrain smaller rerankers/controllers according to the instructions at the bottom of the README.
Remove --no-editor to turn off the Edit module inherited from Re3 (not heavily tested; DOC doesn't use it in our main experiments)
Set --log-level to be something between 21 and 25 to vary the verbosity of logging (higher = less verbose; defaults to 24).

Note About Crashes

Using very small OPT models could lead to crashes since we didn't extensively test the edge cases where all the generated continuations get rejected by our filters (you can set --skip-threshold -10000 to avoid this happening). This may sometimes happen with GPT3 as well when the detailed controller is off. This crash never happened in our main experiments using OPT-175B.

Baselines

Baselines assume you already have a plan generated by our code according to the command described earlier.

Re3 (OPT-175B, matching length with DOC)

Take the plan generated by our code (output/plan.pkl in the commands below) and save just the setting/characters and top-level outline for use in Re3:

python scripts/data/save_re3_plan.py -i output/plan.pkl -o output/re3_plan.pkl

Then follow the instructions in https://github.com/yangkevin2/emnlp22-re3-story-generation. Run with OPT-175B for fair comparison, using --extension-method opt; the Alpa arguments are the same as in this repo. Specify the already-generated plan file using --load-outline-file. You'll want to also set --max-candidates 8 --summarizer-frequency-penalty 1 --summarizer-prompt-penalty 1 as well as --max-continuation-substeps 5 to roughly match the length of stories we generate in our main experiments.

Rolling Window OPT-175B

python -u scripts/rolling_baselines.py {{{ALPA_ARGS}}} --load-outline-file output/plan.pkl --extension-method opt --save-complete-file output/rolling_opt_story.pkl > output/rolling_opt_story.log

Rolling Window GPT3-175B

python -u scripts/rolling_baselines.py --load-outline-file output/plan.pkl --extension-method gpt3 --save-complete-file output/rolling_gpt3_story.pkl > output/rolling_gpt3_story.log

Controller / Reranker Training

Detail Controller Training

Set the checkpoint save directory and run the command below. The training data (derived from InstructGPT-13B summaries of passages from WritingPrompts (Fan et al 2018)) is provided in the data download.

CUDA_VISIBLE_DEVICES=0 python scripts/training/train_controller.py --controller-save-dir {{{SAVE_DIRECTORY}}} --controller fudge_controller --controller-model-string facebook/opt-350m --data-dir doc_data/training_data/detailed_controller_training_data.csv --dataset alignment --loader fine_coherence --batch-size 2 --lower-length-limit 1000 --controller-epochs 20 --num-workers 8 --controller-num-negatives 3 --controller-lr 1e-6 --coherence-negative-categories other shuffle repeat --limit 100000

Outline Order Reranker Training

Set the checkpoint save directory and run the command below. The training data (some very brief, outline-like stories generated from InstructGPT3-175B) is provided in the data download.

CUDA_VISIBLE_DEVICES=0 python scripts/training/train_controller.py --controller-save-dir {{{SAVE_DIRECTORY}}} --controller longformer_classifier --controller-model-string roberta-large --data-dir doc_data/training_data/order_training_data.csv --dataset csv --csv-column story --loader order --batch-size 64 --controller-epochs 20 --controller-lr 1e-5 --limit 100000 --num-workers 8

Relevance and Coherence Reranker Training

If you want to retrain the relevance and coherence rerankers yourself, follow the instructions in https://github.com/yangkevin2/emnlp22-re3-story-generation, since ours are unchanged from theirs.

yangkevin2 / doc-story-generation

readme

DOC: Improving Long Story Coherence With Detailed Outline Control

Installation / Data

Plan + Outline Generation

Plan + Outline Generation Command

Other Arguments

Main Story Generation

OPT-175B Setup

1. Free Public Alpa OPT-175B API (easy to get started; high quality, may be slower)

2. Self-Serve (faster + high-quality; need a lot of compute)

3. GPT3-175B (easiest to get started and fastest; worse quality)

Main Story Generation Command

Other Arguments

Note About Crashes

Baselines

Re3 (OPT-175B, matching length with DOC)

Rolling Window OPT-175B

Rolling Window GPT3-175B

Controller / Reranker Training

Detail Controller Training

Outline Order Reranker Training

Relevance and Coherence Reranker Training