Training guideline for MIMIC-CXR

Markin-Wang commented 4 months ago

Thank you for your work and code. The guideline for training the IU-Xray dataset is detailed in the IU-Xray dataset.

However, the guideline for MIMIC-CXR dataset is missing, e.g., the runining scripts for training the MIMIC-CXR and also the graph construction.

I am grateful if you can provide any information to reproduce the results on MIMIC-CXR datasets.

wjhou commented 4 months ago

Hi @Markin-Wang,

Thank you for your attention. I have uploaded the graph construction script for the MIMIC-CXR dataset which is similar to the one for IU X-ray, and you may find it here (https://github.com/wjhou/ORGan/blob/main/src/graph_construction/run_mimic_cxr.sh).

Regarding the training scripts, it is similar to the scripts of training on the IU X-ray dataset, and you may simply modify the relevant paths in them. Unfortunately, I currently do not have access to the hard drive where I store all my source codes until Monday and these scripts will be uploaded later.

If you have other confusing points, please let me know.

Best, Ethan

Markin-Wang commented 4 months ago

Hi @Markin-Wang,

Thank you for your attention. I have uploaded the graph construction script for the MIMIC-CXR dataset which is similar to the one for IU X-ray, and you may find it here (https://github.com/wjhou/ORGan/blob/main/src/graph_construction/run_mimic_cxr.sh).

Regarding the training scripts, it is similar to the scripts of training on the IU X-ray dataset, and you may simply modify the relevant paths in them. Unfortunately, I currently do not have access to the hard drive where I store all my source codes until Monday and these scripts will be uploaded later.

If you have other confusing points, please let me know.

Best, Ethan

Hi Ethan,

Thank you for your quick reply and further information. It is much helpful to reproduce the results on MIMIC-CXR dataset.

Unfortunately, I currently do not have access to the hard drive where I store all my source codes until Monday and these scripts will be uploaded later.

Thank you for your information. Never mind and looking forword to your update later.

Best Regards, Jun

wjhou commented 4 months ago

Hi @Markin-Wang,

Scripts of training on the MIMIC-CXR dataset have been uploaded and please check it out.

Feel free to reopen this issue if you have any other relevant questions.

Best, Ethan

Markin-Wang commented 4 months ago

Thank you for your update and sorry for bothering you again. Could you also kindly provide the script file for Observation Planning?

Best Regards, Jun

wjhou commented 4 months ago

Hi,

I have updated the code for extracting observation plans, and you can find it here and you may add in your code or pull again. Basically, I just removed the mentions mined using PMI for the MIMIC-CXR dataset.

Now you can just run the following code and get the planning annotations for both IU X-ray and MIMIC-CXR: python ./src/plan_extraction.py

Note that the observation annotations (i.e., id2tag.csv) should be placed in the appropriate folder as stated here.

Best, Ethan

Markin-Wang commented 4 months ago

Hi,

I have updated the code for extracting observation plans, and you can find it here and you may add in your code or pull again. Basically, I just removed the mentions mined using PMI for the MIMIC-CXR dataset.

Now you can just run the following code and get the planning annotations for both IU X-ray and MIMIC-CXR: python ./src/plan_extraction.py

Note that the observation annotations (i.e., id2tag.csv) should be placed in the appropriate folder as stated here.

Best, Ethan

Hi, Thank you for your further update. May I ask the rough training time for observatino planning (Step 1), as when I tried to reproduce the training on RTX A6000 card, the training time for one epoch is around 5 hours, where the GPU utilization is always zero with a sudden increase ( I can see the GPU memoery usage, occupied). May I ask whether this is a normal case, or I have something wrong I guess.

Best Regards, Jun

wjhou commented 4 months ago

Hi,

It sounds quite abnormal. 🤣

We conduct the experiments on an RTX 3090 GPU and Stage 1 takes about 1.5 hours to complete. RXT A6000 should have better performance than RXT3090 in terms of efficiency, i,e., taking less time for training.

Have you finished training and what are the results? How much memory does a model take during training?

I would suggest you check: (1) The version of the Transformer library. Different versions of this library may have different behavior. (2) The data loading module. If the data loader is slow, it could lead to longer training time.

Best, Ethan

Markin-Wang commented 4 months ago

Hi,

It sounds quite abnormal. 🤣

We conduct the experiments on an RTX 3090 GPU and Stage 1 takes about 1.5 hours to complete. RXT A6000 should have better performance than RXT3090 in terms of efficiency, i,e., taking less time for training.

Have you finished training and what are the results? How much memory does a model take during training?

I would suggest you check: (1) The version of the Transformer library. Different versions of this library may have different behavior. (2) The data loading module. If the data loader is slow, it could lead to longer training time.

Best, Ethan

Hi Ethan, Thank you for your reply. Sorry I didn't finish the training as this may be an abnormal case as you mentioned, and the results hence could be wrong.

The memory is around 5G for the default batchsize. The command I use is "./script_plan/run_mimic_cxr.sh 1 None".

I used the same transformer version, and I guess the problem could be the second point as you mentioned, the data loading given the GPU utilization status. I wonder does this part contain heavy data preprocessing procedure? or do you have any suggestions to debug this part?

By the way, if I run the code in debug mode with command "./script_plan/run_mimic_cxr.sh 0 None" , it gives me following error message:

Traceback (most recent call last): File "./src_plan/run_ende.py", line 298, in <module> main() File "./src_plan/run_ende.py", line 260, in main train( File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/train_eval_ende_full.py", line 19, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/jun/anaconda3/envs/tc110/lib/python3.8/site-packages/transformers/trainer.py", line 1414, in train self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/jun/anaconda3/envs/tc110/lib/python3.8/site-packages/transformers/trainer.py", line 1521, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/jun/anaconda3/envs/tc110/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 70, in evaluate return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) File "/home/jun/anaconda3/envs/tc110/lib/python3.8/site-packages/transformers/trainer.py", line 2158, in evaluate output = eval_loop( File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/seq2seqtrainer_metrics_ende.py", line 60, in evaluation_loop metrics = eval_text( File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/train_eval_ende_full.py", line 236, in eval_text bleu_scores = compute_scores(gts=gts, res=res) File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/metrics.py", line 33, in compute_scores score, scores = scorer.compute_score(gts, res) File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/../pycocoevalcap/cider/cider.py", line 50, in compute_score (score, scores) = cider_scorer.compute_score() File "/home/jun/Documents/projects/phd/baselines/ORGan/src_plan/../pycocoevalcap/cider/cider_scorer.py", line 192, in compute_score assert(len(self.ctest) >= max(self.document_frequency.values())) ValueError: max() arg is an empty sequence Is it normal or not?

Best Regards, Jun

Markin-Wang commented 4 months ago

Sorry, to update, this error also occurs in the training model after a period.

wjhou commented 4 months ago

Seems like the model can generate something and it is the pycoco issue. Can you check into details and see the outputs of the model for debugging?

Markin-Wang commented 4 months ago

Seems like the model can generate something and it is the pycoco issue. Can you check into details and see the outputs of the model for debugging?

Hi, could you provide a snippet of annotation file (annotation.json) you used in MIMIC-CXR (or IU-Xray if the annotation format is the same). I directly utilize the annotation file from R2GenCMN. I just want to make sure that it is not the problem of wrong annotation file.

wjhou commented 4 months ago

We use the same annotation as you mentioned. It’s probably caused by the cider script. I remember I did not turn on the cider calculation.

Markin-Wang commented 4 months ago

We use the same annotation as you mentioned. It’s probably caused by the cider script. I remember I did not turn on the cider calculation.

Hi, the code for enabling the CIder evaluation can be seen below: https://github.com/wjhou/ORGan/blob/5921b56cbc504b0172872e52ab19a0daa48d9ebb/src_plan/metrics.py#L25

wjhou commented 4 months ago

Hmmm I am currently conducting some follow up experiments. Let me try some runs and I’ll get back to you later.

Markin-Wang commented 4 months ago

Hmmm I am currently conducting some follow up experiments. Let me try some runs and I’ll get back to you later.

Thank you for your reply. I tried to debug the code and print the output before this line: https://github.com/wjhou/ORGan/blob/5921b56cbc504b0172872e52ab19a0daa48d9ebb/src_plan/train_eval_ende_full.py#L236

The output is:` gts: {0: [''], 1: [''], 2: [''], 3: [''], 4: [''], 5: [''], 6: [''], 7: [''], 8: [''], 9: [''], 10: [''], 11: [''], 12: [''], 13: [''], 14: [''], 15: ['']}

res: {0: ['i v q g r { j a'], 1: ['i v q g r b a'], 2: ['i v q g r { j a'], 3: ['i v q g r b a'], 4: ['i v q g r'], 5: ['i v q g r { j a'], 6: ['i v q g r { j a'], 7: ['i v q g r'], 8: ['i v q g r b a'], 9: ['i v q g r b a'], 10: ['i v q g r b a'], 11: ['i v q g r { j a'], 12: ['i v q g r b a'], 13: ['i v q g r n a'], 14: ['i v q g r n a'], 15: ['i v q g r b a']} `

Hope this information could help you debug the code.

wjhou commented 4 months ago

Well, it seems the model running properly and the issue might stem from the data preparation.

Markin-Wang commented 4 months ago

Well, it seems the model running properly and the issue might stem from the data preparation.

Yes, maybe this is also the reason causing the low training speed and gpu utilization rate. I am grateful if you have any idea about the possible problems. I will also tried to debug the code.

wjhou commented 4 months ago

Hi,

After reading my messy code again, I figured out why it takes so long to finish training.

The images are loaded inside the collator instead of the dataset in the code, although I modified them in my later projects.

To fix it, the loading part should be placed inside the dataset similar to the one of R2Gen. You may try it first, and I will update the code later which could take some time.

Best, Ethan

Markin-Wang commented 4 months ago

Hi,

After reading my messy code again, I figured out why it takes so long to finish training.

The images are loaded inside the collator instead of the dataset in the code, although I modified them in my later projects.

To fix it, the loading part should be placed inside the dataset similar to the one of R2Gen. You may try it first, and I will update the code later which could take some time.

Best, Ethan

Hi Ehthan,

Thank you for your time and update. It is much helpful. I will aslo try to follow your suggestions by myself first and looking forward to your update later.

Best Regards, Jun

wjhou commented 4 months ago

Hi @Markin-Wang

I have updated the code for better efficiency, i.e., moving the image loading part into the dataset. I test on my server and it works fine and you may take it as a reference.

Stage 2 code will be updated later.

Note that these codes are not using the Accelerator library which is deeply integrated into the latter version of the Transformer library.

Best, Ethan

wjhou / ORGan

Training guideline for MIMIC-CXR #9