Question on Dataset Usage During Pretraining and Fine-Tuning

gilyoungCoder commented 1 month ago

Hello,

I found your paper very interesting. I have a question regarding the data sets used during the model training process. Specifically, I would like to know if the same dataset was used for both pretraining and fine-tuning phases. For example, was the Cardiology dataset used during both pretraining and fine-tuning? Or was the foundation model designed to work robustly across all clinical data sets?

Thank you for your clarification.

yzhHoward commented 1 month ago

Thank you for your valuable question. In the paper, our goal is to give the model missingness and improve the performance on these datasets, rather than proposing a universal foundation model. Thus, we use the same datasets in pre-training and fine-tuning.

Nevertheless, we have also considered this by build a foundation model which is pre-trained on multiple datasets (the main operation is to change the dimension of CLS Vector to d and broadcast it to all variables instead of the existing N*d in the code), and fine-tuned on different datasets. Under this setting, we found that SMART can indeed bring improvements, but due to the removal of N, its performance is slightly lower than the results reported in the paper.

In addition, we did not take the foundation model as the theme in the paper at that time because we could not find a suitable baseline for comparison. We have considered proposing a universal foundation model as the theme for subsequent work.

gilyoungCoder commented 1 month ago

Thank you for your valuable feedback. I have a clear understanding of the goals and approaches presented in your paper.

yzhHoward commented 1 month ago

We are glad to address your concerns. If you have any other questions, we are delightful to answer them.

yzhHoward / SMART

Question on Dataset Usage During Pretraining and Fine-Tuning #1