GPU Memory Requirement on Retraining PairRM

harshyadav17 commented 6 months ago

I am trying to train the PairRM (Pair Ranker) on Unified Feedback dataset (884,528 samples). Following is the modified config:

dataset="unified_feedback"
backbone_type="deberta" 
backbone_name="microsoft/deberta-v3-large"
n_gpu=4
ranker="PairRanker" 

 source_maxlength=1224
  candidate_maxlength=412
  per_device_train_batch_size=4
  per_device_eval_batch_size=1
  gradient_accumulation_steps=8
  using_metrics="human_preference"

Rest of the parameters are constant (disabled adafactor as in the zero config Adam was mentioned). Can you please share your insights of the initial training on GPU memory consumption. At present I am training it on 4 A100 GPUs and it is occupying more than 50 GB on each device (I assume that a 400M paramter model should not take more than 16GB on each GPU). Also the training with this setup seems to be slow, given the size of the backbone model.

@jdf-prog @yuchenlin Your insights/suggestions on this would be very helpful. Thanks

jdf-prog commented 6 months ago

Are you training a base model based on deberta? If that's the case, you should decrease the per_device_train_batch_size to 1 and change the gradient_accumulation_steps accordingly. The global batch size should be 64, which equals the multiplication of n_gpu, per_device_train_batch_size, and gradient_accumulation_steps.

And please note that you don't need zero3 config if you are training on deberta. That's for the larger models. You can comment this line in the training script https://github.com/yuchenlin/LLM-Blender/blob/main/train_ranker.sh#L185

harshyadav17 commented 6 months ago

Thanks for the prompt response. I am training the PairRanker with backbone as DeBERTa-v3-large. Will share the insights of the following training round as suggested by you:

per_device_train_batch_size = 1
ngpus = 4
gradient_accumulation_steps = 16
deepspeed disabled (zero3 config)

I still have one doubt, with around 400 M parameters, how is the GPU consumption very high, is it because the dataset is big and takes some memory (~8 lakh data points).

Thanks for your insights!

jdf-prog commented 6 months ago

Well, for the Deberta model it should be able to run on a single GPU without OOM. However, since the data is too much, it's necessary to use more GPUs to train it otherwise you might get an tqdm bar indicating that the training will finish after a hundreds of hours.

Also it might because of deberta's model architecture. In my experience, it consumes about 2 times memory that roberta did.

harshyadav17 commented 6 months ago

Thanks for the help. Can you give more insights on the dataset used to train PairRM model. According to the following model card, it has been trained on the following dataset:


[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
[openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
[Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
[lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
[openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)

But it is only ~55% of the unified feedback dataset. Any reason of not including the entire Unified Feedback dataset (berkeley-nest/Nectar is excluded and it itself is comprises 40% of unified feedback). Would love to know your thoughts and reasons.

harshyadav17 commented 6 months ago

Hey @jdf-prog Can you please share the information on how did you evaluate the closed-source models like GPT4 on Auto-J Pairwise dataset. https://huggingface.co/llm-blender/PairRM#auto-j-pairwise-test-data-performance

It would help us in re-evaluation of the models. Thanks!

jdf-prog commented 6 months ago

Thanks for the help. Can you give more insights on the dataset used to train PairRM model. According to the following model card, it has been trained on the following dataset:
[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
[openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
[Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
[lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
[openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
But it is only ~55% of the unified feedback dataset. Any reason of not including the entire Unified Feedback dataset (berkeley-nest/Nectar is excluded and it itself is comprises 40% of unified feedback). Would love to know your thoughts and reasons.

Sorry about the late reply. The PairRM is trained before the curation of unified feedback. The datasets that are not used for training PairRM are added later for other purposes.

The 2 added datasets are argilla/ultrafeedback-binarized-preferences-cleaned and berkeley-nest/Nectar. The first one is just a better version of Ultrafeedback. For the nectar dataset, we also tried training on it and performance get little improvement. We guess that's because the model's capacity is limited by its size.

More investigation experiments are welcomed on them.

jdf-prog commented 6 months ago

Hey @jdf-prog Can you please share the information on how did you evaluate the closed-source models like GPT4 on Auto-J Pairwise dataset. https://huggingface.co/llm-blender/PairRM#auto-j-pairwise-test-data-performance

It would help us in re-evaluation of the models. Thanks!

The scripts we used for eval PairRM on auto-J is as follows:

import json
import torch

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM")

with open("./testdata_pairwise.jsonl", 'r') as f:
    autoJ_pairwise_test_data = [json.loads(line) for line in f]
instructions = None
inputs = [d["prompt"] for d in autoJ_pairwise_test_data]
cand1_texts = [d["response 1"] for d in autoJ_pairwise_test_data]
cand2_texts = [d["response 2"] for d in autoJ_pairwise_test_data]
labels = [d["label"] for d in autoJ_pairwise_test_data]

cmp_results1 = blender.compare(inputs, cand1_texts, cand2_texts, instructions, mode="[A,B]", return_logits=True)
cmp_results2 = blender.compare(inputs, cand1_texts, cand2_texts, instructions, mode="[B,A]", return_logits=True)
cmp_results = torch.tensor(cmp_results1) - torch.tensor(cmp_results2)

# scenario
total = 0
agree = 0
output_labels = []
output_labels_ex = []
output_labels_single = []
for i in range(len(autoJ_pairwise_test_data)):
    if cmp_results[i] > 0 and labels[i] == 0:
        agree += 1
    elif cmp_results[i] < 0 and labels[i] == 1:
        agree += 1
    else:
        # tie, not agree by default
        pass

    if cmp_results[i] > 0:
        output_labels_single.append({"output": 0})
    elif cmp_results[i] < 0:
        output_labels_single.append({"output": 1})
    else:
        output_labels_single.append({"output": 2})
    if cmp_results1[i] > 0:
        output_labels.append({"output": 0})
    elif cmp_results1[i] < 0:
        output_labels.append({"output": 1})
    else:
        output_labels.append({"output": 2})
    if cmp_results2[i] > 0:
        output_labels_ex.append({"output": 0})
    elif cmp_results2[i] < 0:
        output_labels_ex.append({"output": 1})
    else:
        output_labels_ex.append({"output": 2})

    total += 1
print("scenario. #total: {}, #agree: {}, ratio: {:.4f}".format(total, agree, agree / total))

with open("./PairRM_results.jsonl", 'w') as f:
    for label in output_labels:
        f.write(json.dumps(label) + "\n")
with open("./PairRM_ex_results.jsonl", 'w') as f:
    for label in output_labels_ex:
        f.write(json.dumps(label) + "\n")
with open("./PairRM_single_results.jsonl", 'w') as f:
    for label in output_labels_single:
        f.write(json.dumps(label) + "\n")

The bash script to call auto-J leaderboard python script

python auto-j/codes/leaderboard/pairwise_eval.py \
    --source_file_path ./testdata_pairwise.jsonl \
    --pred_file_path ./PairRM_results.jsonl \
    --exchange_pred_file_path ./PairRM_ex_results.jsonl \
    --type "pairwise" # if "single" you do not need to provide `exchange_pred_file_path`

# Group Name      Agreement       Consistency
# ----------------------------
# Summarization   56.94   97.22
# Exam Questions  52.78   94.44
# Code    57.5    94.17
# Rewriting       52.5    91.67
# Creative Writing        60.65   95.83
# Functional Writing      56.25   90.83
# General Communication   55.21   93.75
# NLP Tasks       59.09   90.15
# ----------------------------
# Overall 56.9    92.96

python auto-j/codes/leaderboard/pairwise_eval.py \
    --source_file_path ./testdata_pairwise.jsonl \
    --pred_file_path ./PairRM_single_results.jsonl \
    --type "single" # if "single" you do not need to provide `exchange_pred_file_path`

# Group Name      Agreement       Consistency
# ----------------------------
# Summarization   56.94   -
# Exam Questions  52.78   -
# Code    58.33   -
# Rewriting       55.83   -
# Creative Writing        61.57   -
# Functional Writing      59.17   -
# General Communication   57.64   -
# NLP Tasks       62.5    -
# ----------------------------
# Overall 59.05   -

Hope that will be helpful.

harshyadav17 commented 6 months ago

Thanks @jdf-prog I am actually interested about the closed source model evaluations, i.e. how did you use GPT4 and ChatGPT to evaluate, if you can share the input prompt for pairwise/single evaluation. Thanks for the prompt response!

jdf-prog commented 6 months ago

Oh, I just copied the number from Auto-J paper. You may refer to their codes for more details.

yuchenlin / LLM-Blender

GPU Memory Requirement on Retraining PairRM #20