whaleloops / KEPT

auto icd coding with prompt
MIT License
47 stars 17 forks source link

F1-micro drop alot for MIMIC-50 if using HF model #10

Closed rianrajagede closed 4 months ago

rianrajagede commented 4 months ago

Hi, I tried to compare the F1 micro result between the model published in HF and the one I need to download from GDrive.

I tried to run the eval step to the MIMIC 50 dataset with the same parameters at the readme. The result from the GDrive model is: f1_micro:0.728503038114521

Then, I changed the model path to the HF model, and the result dropped a lot: f1_micro:0.2161833132150996

Are there any settings that I need to do to get a fair comparison for both model results?

whaleloops commented 4 months ago

After we trained the longformer with knowledge graph (keptlongformer), we further finetuned keptlongformer on the MIMIC-III 50 dataset.

To finetune and eval on MIMIC-III 50 (2 A100 GPU) :

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 --master_port 57666 run_coder.py \
                --ddp_find_unused_parameters False \
                --disable_tqdm True \
                --version mimic3-50 --model_name_or_path whaleloops/keptlongformer \
                --do_train --do_eval --max_seq_length 8192 \
                --per_device_train_batch_size 1 --per_device_eval_batch_size 2 \
                --learning_rate 1.5e-5 --weight_decay 1e-3 --adam_epsilon 1e-7 --num_train_epochs 8 \
                --evaluation_strategy epoch --save_strategy epoch \
                --logging_first_step --global_attention_strides 1 \
                --output_dir ./saved_models/longformer-original-clinical-prompt2alpha
rianrajagede commented 4 months ago

I see.. after reading your comment and re-reading the paper I just realized I missed that, so the published keptlongformer is not designed (trained) for MIMIC III-50, so I need to finetune it first before eval it.

Thank you for the response!