Closed harshyadav17 closed 5 months ago
Are you training a base model based on deberta? If that's the case, you should decrease the per_device_train_batch_size to 1 and change the gradient_accumulation_steps accordingly. The global batch size should be 64, which equals the multiplication of n_gpu, per_device_train_batch_size, and gradient_accumulation_steps.
And please note that you don't need zero3 config if you are training on deberta. That's for the larger models. You can comment this line in the training script https://github.com/yuchenlin/LLM-Blender/blob/main/train_ranker.sh#L185
Thanks for the prompt response. I am training the PairRanker with backbone as DeBERTa-v3-large. Will share the insights of the following training round as suggested by you:
per_device_train_batch_size = 1
ngpus = 4
gradient_accumulation_steps = 16
deepspeed disabled (zero3 config)
I still have one doubt, with around 400 M parameters, how is the GPU consumption very high, is it because the dataset is big and takes some memory (~8 lakh data points).
Thanks for your insights!
Well, for the Deberta model it should be able to run on a single GPU without OOM. However, since the data is too much, it's necessary to use more GPUs to train it otherwise you might get an tqdm bar indicating that the training will finish after a hundreds of hours.
Also it might because of deberta's model architecture. In my experience, it consumes about 2 times memory that roberta did.
Thanks for the help. Can you give more insights on the dataset used to train PairRM model. According to the following model card, it has been trained on the following dataset:
[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
[openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
[Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
[lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
[openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
But it is only ~55% of the unified feedback dataset. Any reason of not including the entire Unified Feedback dataset (berkeley-nest/Nectar is excluded and it itself is comprises 40% of unified feedback). Would love to know your thoughts and reasons.
Hey @jdf-prog Can you please share the information on how did you evaluate the closed-source models like GPT4 on Auto-J Pairwise dataset. https://huggingface.co/llm-blender/PairRM#auto-j-pairwise-test-data-performance
It would help us in re-evaluation of the models. Thanks!
Thanks for the help. Can you give more insights on the dataset used to train PairRM model. According to the following model card, it has been trained on the following dataset:
[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback) [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons) [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
But it is only ~55% of the unified feedback dataset. Any reason of not including the entire Unified Feedback dataset (berkeley-nest/Nectar is excluded and it itself is comprises 40% of unified feedback). Would love to know your thoughts and reasons.
Sorry about the late reply. The PairRM is trained before the curation of unified feedback. The datasets that are not used for training PairRM are added later for other purposes.
The 2 added datasets are argilla/ultrafeedback-binarized-preferences-cleaned and berkeley-nest/Nectar. The first one is just a better version of Ultrafeedback. For the nectar dataset, we also tried training on it and performance get little improvement. We guess that's because the model's capacity is limited by its size.
More investigation experiments are welcomed on them.
Hey @jdf-prog Can you please share the information on how did you evaluate the closed-source models like GPT4 on Auto-J Pairwise dataset. https://huggingface.co/llm-blender/PairRM#auto-j-pairwise-test-data-performance
It would help us in re-evaluation of the models. Thanks!
The scripts we used for eval PairRM on auto-J is as follows:
import json
import torch
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM")
with open("./testdata_pairwise.jsonl", 'r') as f:
autoJ_pairwise_test_data = [json.loads(line) for line in f]
instructions = None
inputs = [d["prompt"] for d in autoJ_pairwise_test_data]
cand1_texts = [d["response 1"] for d in autoJ_pairwise_test_data]
cand2_texts = [d["response 2"] for d in autoJ_pairwise_test_data]
labels = [d["label"] for d in autoJ_pairwise_test_data]
cmp_results1 = blender.compare(inputs, cand1_texts, cand2_texts, instructions, mode="[A,B]", return_logits=True)
cmp_results2 = blender.compare(inputs, cand1_texts, cand2_texts, instructions, mode="[B,A]", return_logits=True)
cmp_results = torch.tensor(cmp_results1) - torch.tensor(cmp_results2)
# scenario
total = 0
agree = 0
output_labels = []
output_labels_ex = []
output_labels_single = []
for i in range(len(autoJ_pairwise_test_data)):
if cmp_results[i] > 0 and labels[i] == 0:
agree += 1
elif cmp_results[i] < 0 and labels[i] == 1:
agree += 1
else:
# tie, not agree by default
pass
if cmp_results[i] > 0:
output_labels_single.append({"output": 0})
elif cmp_results[i] < 0:
output_labels_single.append({"output": 1})
else:
output_labels_single.append({"output": 2})
if cmp_results1[i] > 0:
output_labels.append({"output": 0})
elif cmp_results1[i] < 0:
output_labels.append({"output": 1})
else:
output_labels.append({"output": 2})
if cmp_results2[i] > 0:
output_labels_ex.append({"output": 0})
elif cmp_results2[i] < 0:
output_labels_ex.append({"output": 1})
else:
output_labels_ex.append({"output": 2})
total += 1
print("scenario. #total: {}, #agree: {}, ratio: {:.4f}".format(total, agree, agree / total))
with open("./PairRM_results.jsonl", 'w') as f:
for label in output_labels:
f.write(json.dumps(label) + "\n")
with open("./PairRM_ex_results.jsonl", 'w') as f:
for label in output_labels_ex:
f.write(json.dumps(label) + "\n")
with open("./PairRM_single_results.jsonl", 'w') as f:
for label in output_labels_single:
f.write(json.dumps(label) + "\n")
The bash script to call auto-J leaderboard python script
python auto-j/codes/leaderboard/pairwise_eval.py \
--source_file_path ./testdata_pairwise.jsonl \
--pred_file_path ./PairRM_results.jsonl \
--exchange_pred_file_path ./PairRM_ex_results.jsonl \
--type "pairwise" # if "single" you do not need to provide `exchange_pred_file_path`
# Group Name Agreement Consistency
# ----------------------------
# Summarization 56.94 97.22
# Exam Questions 52.78 94.44
# Code 57.5 94.17
# Rewriting 52.5 91.67
# Creative Writing 60.65 95.83
# Functional Writing 56.25 90.83
# General Communication 55.21 93.75
# NLP Tasks 59.09 90.15
# ----------------------------
# Overall 56.9 92.96
python auto-j/codes/leaderboard/pairwise_eval.py \
--source_file_path ./testdata_pairwise.jsonl \
--pred_file_path ./PairRM_single_results.jsonl \
--type "single" # if "single" you do not need to provide `exchange_pred_file_path`
# Group Name Agreement Consistency
# ----------------------------
# Summarization 56.94 -
# Exam Questions 52.78 -
# Code 58.33 -
# Rewriting 55.83 -
# Creative Writing 61.57 -
# Functional Writing 59.17 -
# General Communication 57.64 -
# NLP Tasks 62.5 -
# ----------------------------
# Overall 59.05 -
Hope that will be helpful.
Thanks @jdf-prog I am actually interested about the closed source model evaluations, i.e. how did you use GPT4 and ChatGPT to evaluate, if you can share the input prompt for pairwise/single evaluation. Thanks for the prompt response!
Oh, I just copied the number from Auto-J paper. You may refer to their codes for more details.
I am trying to train the PairRM (Pair Ranker) on Unified Feedback dataset (884,528 samples). Following is the modified config:
Rest of the parameters are constant (disabled adafactor as in the zero config Adam was mentioned). Can you please share your insights of the initial training on GPU memory consumption. At present I am training it on 4 A100 GPUs and it is occupying more than 50 GB on each device (I assume that a 400M paramter model should not take more than 16GB on each GPU). Also the training with this setup seems to be slow, given the size of the backbone model.
@jdf-prog @yuchenlin Your insights/suggestions on this would be very helpful. Thanks