zhaoxlpku / HKU-DASC7606-A2

18 stars 11 forks source link

Double check this repo flow #3

Closed tengwang0318 closed 6 months ago

tengwang0318 commented 6 months ago

If you think what I asking and talking about in public is unfair for this assignment, you should delete this issue. Sorry for any inconvenience caused.

I want to double check this assignment flow. After reading and reviewing the code, I am not sure whether my thought is right or not.

  1. using bge model to calculate the similarity between train and test and choosing the most K samples from training dataset as in context.
  2. Use phi1.5 and the generated prompt from step 1, to generate the result.

What's more, it's too confused about the readme file. Especially for these command you wrote:

python eval_fewshot.py --data_path "data/ARC-Easy-validation.jsonl" --device_id "0,1" --model path_of_llm --embedder path_of_embedder --start_index 0 --end_index 9999 --max_len 1024 --output_path path_to_save_model_predictions --overwrite False --prompt_type "v2.0" --N 8 --top_k True --top_k_reverse False
python acc.py --prediction_path path_to_save_model_predictions

I think when we submit the assignment, we should change data-path to ARC-Easy-test.jsonl. Is it right? If it's right, after finishing "write your code" and before releasing test dataset, we want to test whether the code could run, there has an issue that the file doesn't contain "test" in 235 line eval_fewshot.py,

demonstrations = load_all_demonstrations(args.data_path.replace("test", "train"))

When we validate the model, it should be "validation", when we test the model, it should be "test".

Just changing demonstrations = load_all_demonstrations(args.data_path.replace("test", "train")) to demonstrations = load_all_demonstrations(args.data_path.replace("test", "train").replace("validation", "train")) The performance change is as follows:

Before changing:
python eval_fewshot.py --data_path "data/ARC-Easy-validation.jsonl" --device_id "0,1" --model "models/phi-1_5" --embedder models/bge-small-en-v1.5 --start_index 0 --end_index 9999 --max_len 1024 --output_path "outputs/" --overwrite True --prompt_type "v2.0" --N 8 --top_k True --top_k_reverse False
performance: 0.980701754385965

python eval_fewshot.py --data_path "data/ARC-Challenge-validation.jsonl" --device_id "0,1" --model "models/phi-1_5" --embedder models/bge-small-en-v1.5 --start_index 0 --end_index 9999 --max_len 1024 --output_path "outputs/" --overwrite True --prompt_type "v2.0" --N 8 --top_k True --top_k_reverse False
performance: 0.9649737302977233

After changing:
python eval_fewshot.py --data_path "data/ARC-Easy-validation.jsonl" --device_id "0,1" --model "models/phi-1_5" --embedder models/bge-small-en-v1.5 --start_index 0 --end_index 9999 --max_len 1024 --output_path "outputs/" --overwrite True --prompt_type "v2.0" --N 8 --top_k True --top_k_reverse False
performance: 0.8017543859649123

python eval_fewshot.py --data_path "data/ARC-Challenge-validation.jsonl" --device_id "0,1" --model "models/phi-1_5" --embedder models/bge-small-en-v1.5 --start_index 0 --end_index 9999 --max_len 1024 --output_path "outputs/" --overwrite True --prompt_type "v2.0" --N 8 --top_k True --top_k_reverse False
performance: 0.6654991243432574
zhaoxlpku commented 6 months ago

Thank you for highlighting this issue. Your grasp of the submission pipeline is accurate. I appreciate your advice and will address the bug accordingly.