data leak risk - Githubissues

whaleloops / KEPT

auto icd coding with prompt

MIT License

46 stars 17 forks source link

Hi, thanks for your question.

Yes, data leakage might occur. One could try fine-tune on MIMIC-IV and see the effect of data leakage. But the pretaining task (predict token masking) is very different from icd coding task (multi label classification), which is why pretrained Clinical-Longformer (w/o HSAP &Prompt row in table 1) does not outperform unpretrained but task-specific engineered model such as JointLAAT.

p.s. We also explored pretraining on other private corpus with phrase and sentence masking and then fientuned on MIMIC-III. So no leakage in this paper. Results are shown here.

whaleloops / KEPT

data leak risk #4