Closed Cascol-Chen closed 1 year ago
Thank you for your issue! I will check if this is a mistake and response your asap.
Hi, @Cascol-SCUT , Thank you again for your interests in LLM-Blender and your script! Yeah, we do not consider the cases when the selected model is the same as the compared model as it might be too unfair for the fuser evaluation: fuser does not have a Vic or OA output to select directly! Instead, I would say the ">=" here better respresent the case where 2 different models are viewed as "Same good" by ChatGPT. In this case, the comparion of the quality between "direct selection" (ranker) and generating a new one (fuser) would be more fair. Otherwise, the beat percentage would prefer the "direct selection" (ranker) too much.
In either way, the return value should not be zero, because it means that the selected is not as good. Even returning a specific value to indicate non consideration would be more reasonable.
Maybe returning 0.5 when the answer is same good and 1.0 when it’s better is the fairest.
Maybe returning 0.5 when the answer is same good and 1.0 when it’s better is the fairest.
That's a good advice. We would consider this in our future work and keep improving our evaluation strategy.
Moreover, in the case when the generated one get the result of "same good" compared with viccuna, it would get a positve value while the direct selection get 0, which is unfair because there's maybe no improvement.
Moreover, in the case when the generated one get the result of "same good" compared with viccuna, it would get a positve value while the direct selection get 0, which is unfair because there's maybe no improvement.
That's right. This is a potential bad case for our used evaluation strategy. At least, I think this will only affect the beat OA and Vic metrics. The main conclusion would be still fine.
Could you kindly provide the performance of LLM-blender using the calculation above? The code should be:
def getRankCompare(gpt_cmps:dict, selected_model, base_model):
if selected_model == base_model: return 0.5
cmp_str, st = selected_model+','+base_model, True
if cmp_str not in gpt_cmps.keys():
cmp_str, st = base_model+','+selected_model, False
cmp_result = gpt_cmps.get(cmp_str)
if 'good' in cmp_result: return 0.5
elif 'bad' in cmp_result: return 0
elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
else: return 0
with the above script:
@Cascol-SCUT Yeah, of course. It's worthy to note is that, since you ask for the performance of the changing logic, the performance of both the base LLMs, rankers and LLM-Blender would change for the two metrics Beat OA and Beat Vic. Therefore, here I provide all the performance after the changing.
With this script logic,
def getRankCompare(gpt_cmps:dict, selected_model, base_model):
if selected_model == base_model: return 0.5
cmp_str, st = selected_model+','+base_model, True
if cmp_str not in gpt_cmps.keys():
cmp_str, st = base_model+','+selected_model, False
cmp_result = gpt_cmps.get(cmp_str)
if 'good' in cmp_result: return 0.5
elif 'bad' in cmp_result: return 0
elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
else: return 0
all the the performance should be
======================================
Percentage of times that different llms and rankers beats vicuna
LLM:
oasst-sft-4-pythia-12b-epoch-3.5 : 47.96
koala-7B-HF : 30.67
alpaca-native : 43.67
llama-7b-hf-baize-lora-bf16 : 41.01
flan-t5-xxl : 17.87
stablelm-tuned-alpha-7b : 16.91
vicuna-13b-1.1 : 50.00
dolly-v2-12b : 27.08
moss-moon-003-sft : 40.65
chatglm-6b : 35.35
mpt-7b-instruct : 25.86
Oracle:
BERTScore : 48.92
BLEURT : 49.85
BARTScore : 50.73
Ranker:
Random : 33.64
MLM-Scoring : 29.10
SimCLS : 55.41
SummaReranker : 54.79
PairRanker : 58.29
LLM-BLender:
Gen-Fuser: 53.05
======================================
Percentage of times that different llms and rankers beats openassistant
LLM:
oasst-sft-4-pythia-12b-epoch-3.5 : 50.00
koala-7B-HF : 29.18
alpaca-native : 44.49
llama-7b-hf-baize-lora-bf16 : 42.38
flan-t5-xxl : 13.73
stablelm-tuned-alpha-7b : 14.54
vicuna-13b-1.1 : 49.95
dolly-v2-12b : 25.23
moss-moon-003-sft : 39.46
chatglm-6b : 33.87
mpt-7b-instruct : 24.12
Oracle:
BERTScore : 50.43
BLEURT : 50.56
BARTScore : 51.97
Ranker:
Random : 32.53
MLM-Scoring : 27.37
SimCLS : 57.16
SummaReranker : 57.15
PairRanker : 60.58
LLM-Blender:
Gen-Fuser: 53.90
As you can see, the LLM-Blender's results is still better than all the Oracles (though it doesn't have to be). However, this logic still perfer rankers a lot: Rankers could get way better performance on these two metrics. That's because, Vic and OA are both strong LLMs, there are many cases where rankers could select these 2 models' outputs and thus causes the preference I talked about above
Hi, @Cascol-SCUT , Thank you again for your interests in LLM-Blender and your script! Yeah, we do not consider the cases when the selected model is the same as the compared model as it might be too unfair for the fuser evaluation: fuser does not have a Vic or OA output to select directly! Instead, I would say the ">=" here better respresent the case where 2 different models are viewed as "Same good" by ChatGPT. In this case, the comparion of the quality between "direct selection" (ranker) and generating a new one (fuser) would be more fair. Otherwise, the beat percentage would prefer the "direct selection" (ranker) too much.
Instead, I recomend using the following scirp logic to process:
def getRankCompare(gpt_cmps:dict, selected_model, base_model):
if selected_model == base_model: return 0.5 # return 0.5 when same model outputs are compared.
cmp_str, st = selected_model+','+base_model, True
if cmp_str not in gpt_cmps.keys():
cmp_str, st = base_model+','+selected_model, False
cmp_result = gpt_cmps.get(cmp_str)
if 'good' in cmp_result: return 1.0 # return 1 when different models are viewed as same good
elif 'bad' in cmp_result: return 0
elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
else: return 0
Compared with the former one, this one only changes line 7, change from return 0.5 to return 1. In this case, the comparison would be fairer, as we only give half wins for the same model comparison. The llm-blender performance results using this script logic are:
======================================
Percentage of times that different llms and rankers beats vicuna
LLM:
oasst-sft-4-pythia-12b-epoch-3.5 : 62.78
koala-7B-HF : 39.93
alpaca-native : 56.70
llama-7b-hf-baize-lora-bf16 : 52.76
flan-t5-xxl : 23.89
stablelm-tuned-alpha-7b : 21.55
vicuna-13b-1.1 : 50.00
dolly-v2-12b : 33.33
moss-moon-003-sft : 51.62
chatglm-6b : 44.04
mpt-7b : 23.31
mpt-7b-instruct : 30.87
Oracle:
BERTScore : 61.05
BLEURT : 61.81
BARTScore : 61.01
Ranker:
Random : 41.67
MLM-Scoring : 37.79
SimCLS : 66.68
SummaReranker : 67.23
PairRanker : 69.75
LLM-Blender:
Gen-Fuser : 70.73
======================================
Percentage of times that different llms and rankers beats openassistant
LLM:
oasst-sft-4-pythia-12b-epoch-3.5 : 50.00
koala-7B-HF : 39.01
alpaca-native : 61.35
llama-7b-hf-baize-lora-bf16 : 56.40
flan-t5-xxl : 19.93
stablelm-tuned-alpha-7b : 19.87
vicuna-13b-1.1 : 64.77
dolly-v2-12b : 31.44
moss-moon-003-sft : 51.79
chatglm-6b : 45.67
mpt-7b : 21.04
mpt-7b-instruct : 30.16
Oracle:
BERTScore : 60.38
BLEURT : 61.96
BARTScore : 65.27
Ranker:
Random : 41.39
MLM-Scoring : 36.44
SimCLS : 68.48
SummaReranker : 67.85
PairRanker : 73.44
LLM-Blender:
Gen-Fuser : 77.72
Again, this verion results are totally consistent with paper results.
I would say this issue indeed is a point to be improved for the performance evaluation. Different script logic would give different results. However, with all the above reevaluation and results, I would say the main conclusion of the paper wouldn't change.
@Cascol-SCUT Thank you again for your careful notice and interests in LLM-Blender. Hope the above response helps you. Would you mind I close this issue if you don't have further question?
Thank you for your helpful reply! I have no more questions and will close this issue.
I believe there is a mistake in calculating the metric ">= Vic" and ">= OA" by reimplementation.
For brief, it seems that when Oracle(BERTScore) select Vicunna as its best model ">= Vic" will set to be false, which violates its definiton of "better than or same good as". The same case on ">= OA". After correctly impementing both metrics:
My detail code is as follow:
by the way, changing line 4 to
if selected_model == base_model: return 0
can achieve exactly the same performance announced in the paper.