whaleloops / KEPT

auto icd coding with prompt
MIT License
46 stars 17 forks source link

data leak risk #4

Open DopamineLcy opened 1 year ago

DopamineLcy commented 1 year ago

Hi, thank you for your nice work! But I have question that the model is initialized with Clinical-Longformer (https://huggingface.co/yikuan8/Clinical-Longformer), which is trained with mimic-iii note data. Is there any risk of data leak, i.e. the test set is appeared during the pre-training of Clinical-Longformer? Thank you very much.

Best,

whaleloops commented 1 year ago

Hi, thanks for your question.

Yes, data leakage might occur. One could try fine-tune on MIMIC-IV and see the effect of data leakage. But the pretaining task (predict token masking) is very different from icd coding task (multi label classification), which is why pretrained Clinical-Longformer (w/o HSAP &Prompt row in table 1) does not outperform unpretrained but task-specific engineered model such as JointLAAT.

p.s. We also explored pretraining on other private corpus with phrase and sentence masking and then fientuned on MIMIC-III. So no leakage in this paper. Results are shown here.