Issue with calculating >= Vic and OA

Cascol-Chen commented 1 year ago

I believe there is a mistake in calculating the metric ">= Vic" and ">= OA" by reimplementation.

For brief, it seems that when Oracle(BERTScore) select Vicunna as its best model ">= Vic" will set to be false, which violates its definiton of "better than or same good as". The same case on ">= OA". After correctly impementing both metrics:

Oracle (BERTScore) achieves 67.68, 81.91
Oracle (BLEURT) achieves 68.02, 78.12
Oracle (BARTScore) achieves 71.70, 73.53
Oracle (BARTScore+BLEURT) achieves 72.40, 78.45, which is better than LLM-Blender

My detail code is as follow:

metrics = ['bertscore', 'bartscore', 'bleurt', 'gVic', 'gOA']

def getRankCompare(gpt_cmps:dict, selected_model, base_model):
    if selected_model == base_model: return 1
    cmp_str, st = selected_model+','+base_model, True
    if cmp_str not in gpt_cmps.keys():
        cmp_str, st = base_model+','+selected_model, False
    cmp_result = gpt_cmps.get(cmp_str)
    if 'good' in cmp_result: return 1
    elif 'bad' in cmp_result: return 0
    elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
    else: return 0

def getMetrics(data, idx): 
    selected_model = data['candidates'][idx]
    model_scores = selected_model['scores']
    model_name = selected_model['model']
    result = {}
    result['bertscore'], result['bartscore'], result['bleurt'] = model_scores['bertscore'], model_scores['bartscore'], model_scores['bleurt']
    cmp_results = json.loads(data['cmp_results'])
    if cmp_results is None:
        return result
    result['gVic'] = getRankCompare(cmp_results, model_name, 'vicuna-13b-1.1')
    result['gOA'] = getRankCompare(cmp_results, model_name, 'oasst-sft-4-pythia-12b-epoch-3.5')
    return result

def custom_compare(value, current_scores):
    current_value = current_scores['bleurt']
    if value < current_value: return True, current_value
    else: return False, -1

metrics_gater = {metric:[0, 0] for metric in metrics}
with jsonlines.open('./test_data_prepared.jsonl', 'r') as f:
    for data in f:
        candidates = data['candidates']
        value, idx = -1e8, -1
        for i, model_outputs in enumerate(candidates):
            scores = model_outputs['scores']
            result = custom_compare(value, scores)
            if result[0]:
                value = result[1]
                idx = i
        metric_results = getMetrics(data, idx)
        for key, val in metric_results.items():
            metrics_gater[key][0] += val
            metrics_gater[key][1] += 1
for key, val in metrics_gater.items():
    print(f'{key}: {val[0]/val[1]:.4f}')

by the way, changing line 4 to if selected_model == base_model: return 0 can achieve exactly the same performance announced in the paper.

jdf-prog commented 1 year ago

Thank you for your issue! I will check if this is a mistake and response your asap.

jdf-prog commented 1 year ago

Hi, @Cascol-SCUT , Thank you again for your interests in LLM-Blender and your script! Yeah, we do not consider the cases when the selected model is the same as the compared model as it might be too unfair for the fuser evaluation: fuser does not have a Vic or OA output to select directly! Instead, I would say the ">=" here better respresent the case where 2 different models are viewed as "Same good" by ChatGPT. In this case, the comparion of the quality between "direct selection" (ranker) and generating a new one (fuser) would be more fair. Otherwise, the beat percentage would prefer the "direct selection" (ranker) too much.

Cascol-Chen commented 1 year ago

In either way, the return value should not be zero, because it means that the selected is not as good. Even returning a specific value to indicate non consideration would be more reasonable.

Cascol-Chen commented 1 year ago

Maybe returning 0.5 when the answer is same good and 1.0 when it’s better is the fairest.

jdf-prog commented 1 year ago

Maybe returning 0.5 when the answer is same good and 1.0 when it’s better is the fairest.

That's a good advice. We would consider this in our future work and keep improving our evaluation strategy.

Cascol-Chen commented 1 year ago

Moreover, in the case when the generated one get the result of "same good" compared with viccuna, it would get a positve value while the direct selection get 0, which is unfair because there's maybe no improvement.

jdf-prog commented 1 year ago

Moreover, in the case when the generated one get the result of "same good" compared with viccuna, it would get a positve value while the direct selection get 0, which is unfair because there's maybe no improvement.

That's right. This is a potential bad case for our used evaluation strategy. At least, I think this will only affect the beat OA and Vic metrics. The main conclusion would be still fine.

Cascol-Chen commented 1 year ago

Could you kindly provide the performance of LLM-blender using the calculation above? The code should be:

def getRankCompare(gpt_cmps:dict, selected_model, base_model):
    if selected_model == base_model: return 0.5
    cmp_str, st = selected_model+','+base_model, True
    if cmp_str not in gpt_cmps.keys():
        cmp_str, st = base_model+','+selected_model, False
    cmp_result = gpt_cmps.get(cmp_str)
    if 'good' in cmp_result: return 0.5
    elif 'bad' in cmp_result: return 0
    elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
    else: return 0

with the above script:

Oracle (BERTScore) achieves 48.92, 50.43
Oracle (BLEURT) achieves 49.85, 50.56
Oracle (BARTScore) achieves 50.73, 51.97
Oracle (BARTScore+BLEURT) achieves 52.12, 53.08

jdf-prog commented 1 year ago

@Cascol-SCUT Yeah, of course. It's worthy to note is that, since you ask for the performance of the changing logic, the performance of both the base LLMs, rankers and LLM-Blender would change for the two metrics Beat OA and Beat Vic. Therefore, here I provide all the performance after the changing.

jdf-prog commented 1 year ago

With this script logic,

def getRankCompare(gpt_cmps:dict, selected_model, base_model):
    if selected_model == base_model: return 0.5
    cmp_str, st = selected_model+','+base_model, True
    if cmp_str not in gpt_cmps.keys():
        cmp_str, st = base_model+','+selected_model, False
    cmp_result = gpt_cmps.get(cmp_str)
    if 'good' in cmp_result: return 0.5
    elif 'bad' in cmp_result: return 0
    elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
    else: return 0

all the the performance should be

======================================
Percentage of times that different llms and rankers beats vicuna
LLM: 
oasst-sft-4-pythia-12b-epoch-3.5 : 47.96
koala-7B-HF : 30.67
alpaca-native : 43.67
llama-7b-hf-baize-lora-bf16 : 41.01
flan-t5-xxl : 17.87
stablelm-tuned-alpha-7b : 16.91
vicuna-13b-1.1 : 50.00
dolly-v2-12b : 27.08
moss-moon-003-sft : 40.65
chatglm-6b : 35.35
mpt-7b-instruct : 25.86

Oracle:
BERTScore : 48.92
BLEURT : 49.85
BARTScore : 50.73

Ranker: 
Random : 33.64
MLM-Scoring : 29.10
SimCLS : 55.41
SummaReranker : 54.79
PairRanker : 58.29

LLM-BLender:
Gen-Fuser: 53.05

======================================
Percentage of times that different llms and rankers beats openassistant
LLM: 
oasst-sft-4-pythia-12b-epoch-3.5 : 50.00
koala-7B-HF : 29.18
alpaca-native : 44.49
llama-7b-hf-baize-lora-bf16 : 42.38
flan-t5-xxl : 13.73
stablelm-tuned-alpha-7b : 14.54
vicuna-13b-1.1 : 49.95
dolly-v2-12b : 25.23
moss-moon-003-sft : 39.46
chatglm-6b : 33.87
mpt-7b-instruct : 24.12

Oracle:
BERTScore : 50.43
BLEURT : 50.56
BARTScore : 51.97

Ranker: 
Random : 32.53
MLM-Scoring : 27.37
SimCLS : 57.16
SummaReranker : 57.15
PairRanker : 60.58

LLM-Blender:
Gen-Fuser: 53.90

As you can see, the LLM-Blender's results is still better than all the Oracles (though it doesn't have to be). However, this logic still perfer rankers a lot: Rankers could get way better performance on these two metrics. That's because, Vic and OA are both strong LLMs, there are many cases where rankers could select these 2 models' outputs and thus causes the preference I talked about above

Hi, @Cascol-SCUT , Thank you again for your interests in LLM-Blender and your script! Yeah, we do not consider the cases when the selected model is the same as the compared model as it might be too unfair for the fuser evaluation: fuser does not have a Vic or OA output to select directly! Instead, I would say the ">=" here better respresent the case where 2 different models are viewed as "Same good" by ChatGPT. In this case, the comparion of the quality between "direct selection" (ranker) and generating a new one (fuser) would be more fair. Otherwise, the beat percentage would prefer the "direct selection" (ranker) too much.

jdf-prog commented 1 year ago

Instead, I recomend using the following scirp logic to process:

def getRankCompare(gpt_cmps:dict, selected_model, base_model):
    if selected_model == base_model: return 0.5 # return 0.5 when same model outputs are compared.
    cmp_str, st = selected_model+','+base_model, True
    if cmp_str not in gpt_cmps.keys():
        cmp_str, st = base_model+','+selected_model, False
    cmp_result = gpt_cmps.get(cmp_str)
    if 'good' in cmp_result: return 1.0 # return 1 when different models are viewed as same good
    elif 'bad' in cmp_result: return 0
    elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
    else: return 0

Compared with the former one, this one only changes line 7, change from return 0.5 to return 1. In this case, the comparison would be fairer, as we only give half wins for the same model comparison. The llm-blender performance results using this script logic are:

======================================
Percentage of times that different llms and rankers beats vicuna
LLM: 
oasst-sft-4-pythia-12b-epoch-3.5 : 62.78
koala-7B-HF : 39.93
alpaca-native : 56.70
llama-7b-hf-baize-lora-bf16 : 52.76
flan-t5-xxl : 23.89
stablelm-tuned-alpha-7b : 21.55
vicuna-13b-1.1 : 50.00
dolly-v2-12b : 33.33
moss-moon-003-sft : 51.62
chatglm-6b : 44.04
mpt-7b : 23.31
mpt-7b-instruct : 30.87

Oracle:
BERTScore : 61.05
BLEURT : 61.81
BARTScore : 61.01

Ranker: 
Random : 41.67
MLM-Scoring : 37.79
SimCLS : 66.68
SummaReranker : 67.23
PairRanker : 69.75

LLM-Blender:
Gen-Fuser : 70.73

======================================
Percentage of times that different llms and rankers beats openassistant
LLM: 
oasst-sft-4-pythia-12b-epoch-3.5 : 50.00
koala-7B-HF : 39.01
alpaca-native : 61.35
llama-7b-hf-baize-lora-bf16 : 56.40
flan-t5-xxl : 19.93
stablelm-tuned-alpha-7b : 19.87
vicuna-13b-1.1 : 64.77
dolly-v2-12b : 31.44
moss-moon-003-sft : 51.79
chatglm-6b : 45.67
mpt-7b : 21.04
mpt-7b-instruct : 30.16

Oracle:
BERTScore : 60.38
BLEURT : 61.96
BARTScore : 65.27

Ranker: 
Random : 41.39
MLM-Scoring : 36.44
SimCLS : 68.48
SummaReranker : 67.85
PairRanker : 73.44

LLM-Blender:
Gen-Fuser : 77.72

Again, this verion results are totally consistent with paper results.

jdf-prog commented 1 year ago

I would say this issue indeed is a point to be improved for the performance evaluation. Different script logic would give different results. However, with all the above reevaluation and results, I would say the main conclusion of the paper wouldn't change.

jdf-prog commented 1 year ago

@Cascol-SCUT Thank you again for your careful notice and interests in LLM-Blender. Hope the above response helps you. Would you mind I close this issue if you don't have further question?

Cascol-Chen commented 1 year ago

Thank you for your helpful reply! I have no more questions and will close this issue.

yuchenlin / LLM-Blender

Issue with calculating >= Vic and OA #6

I believe there is a mistake in calculating the metric ">= Vic" and ">= OA" by reimplementation.