xmed-lab / DHC

MICCAI 2023: DHC: Dual-debiased Heterogeneous Co-training Framework for Class-imbalanced Semi-supervised Medical Image Segmentation
MIT License
50 stars 2 forks source link

How to reproduce the similar results in your paper #7

Closed Chenzhgit closed 5 months ago

Chenzhgit commented 6 months ago

Thank you for your excellent work, which has greatly helped my own work. However, I have a few questions regarding the implementation of the DHC project. In October 2023, I downloaded your complete code to reproduce the results. I used my own computer and connected remotely via VS Code to my university's server. The server provides NVIDIA P100 16GB GPUs. Upon downloading the DHC project code, I encountered some issues, such as needing to uncomment certain lines in the main function of preprocess_amos.py for AMOS data preprocessing, and needing to modify the '0's to tensors in a few lines of code within the 'cls_learn' function of the 'dhc.py' script. After resolving these errors, I made no further modifications to your code, and I also used the dataset split uploaded to GitHub in October to avoid generating random splits by myself. However, during the reproduction process, I found that the results are very unstable and did not match those presented in your paper. Even when using the latest updates from your GitHub project and dataset split, the results reproduced are worse than before.

  1. How did you obtain the results presented in your paper? Did you select the optimal result or take the average of multiple experiments? If so, what should I do to achieve the results presented in your paper, or what should I do to meet the standards for obtaining results as described in your paper? Cause I d like to present the results by adhering to your standards in your paper and cite your work in the future.

  2. Regardless of how I attempt to reproduce the results of the DHC project, such as with the Synapse dataset, the weight curves for 'distdw' and 'diffdw' that I obtain are completely different from those presented in your paper. How did you adjust them for your paper, and what modifications should I make?

  3. I have another small question. In the 'Dataset and Implementation Details' section under the '3. Experiment' chapter of your paper, are the organs listed in order along with their corresponding labels (i.e., 0, 1, 2, 3, ...)? And are they consistent with the labels provided on the dataset's official website? In the 'dhc.py' script, there is a 'np.histogram' function in 'distdw' that counts the number of different categories of labels, starting from 0. During the reproduction process, I assumed that these are all arranged in the same order as the organs provided in your paper.

Thank you very much for your assistance.

Taking the Synapse dataset as an example, below are my reproduction results: Screenshot_20240318115915 dhcweight_A_trend_epochs_160 dhcweight_B_trend_epochs_160

yinguanchun commented 6 months ago

I want to know the size of synapse you use.,the npy that auther upload is 512×512×d and the nii.gz that author upload is 72×144×144 ,but in processed.py both of them seems wrong.Besides,the data samples have some replication between traing , eval and test.

Chenzhgit commented 6 months ago

@yinguanchun Hi, the original size of one preprocessed Synapse is 80x160x160, the size of one Synapse data in dataloader is 64x128x128. The current version of this dhc code leads to lower segmentation accuracy. So I ran the old version released more than half a year ago.

yinguanchun commented 6 months ago

could you give me the old version code and the splits of Amos and Synapse? Thank you!

@yinguanchun Hi, the original size of one preprocessed Synapse is 80x160x160, the size of one

McGregorWwww commented 6 months ago

Thank you for your excellent work, which has greatly helped my own work. However, I have a few questions regarding the implementation of the DHC project. In October 2023, I downloaded your complete code to reproduce the results. I used my own computer and connected remotely via VS Code to my university's server. The server provides NVIDIA P100 16GB GPUs. Upon downloading the DHC project code, I encountered some issues, such as needing to uncomment certain lines in the main function of preprocess_amos.py for AMOS data preprocessing, and needing to modify the '0's to tensors in a few lines of code within the 'cls_learn' function of the 'dhc.py' script. After resolving these errors, I made no further modifications to your code, and I also used the dataset split uploaded to GitHub in October to avoid generating random splits by myself. However, during the reproduction process, I found that the results are very unstable and did not match those presented in your paper. Even when using the latest updates from your GitHub project and dataset split, the results reproduced are worse than before.

  1. How did you obtain the results presented in your paper? Did you select the optimal result or take the average of multiple experiments? If so, what should I do to achieve the results presented in your paper, or what should I do to meet the standards for obtaining results as described in your paper? Cause I d like to present the results by adhering to your standards in your paper and cite your work in the future.
  2. Regardless of how I attempt to reproduce the results of the DHC project, such as with the Synapse dataset, the weight curves for 'distdw' and 'diffdw' that I obtain are completely different from those presented in your paper. How did you adjust them for your paper, and what modifications should I make?
  3. I have another small question. In the 'Dataset and Implementation Details' section under the '3. Experiment' chapter of your paper, are the organs listed in order along with their corresponding labels (i.e., 0, 1, 2, 3, ...)? And are they consistent with the labels provided on the dataset's official website? In the 'dhc.py' script, there is a 'np.histogram' function in 'distdw' that counts the number of different categories of labels, starting from 0. During the reproduction process, I assumed that these are all arranged in the same order as the organs provided in your paper.

Thank you very much for your assistance.

Taking the Synapse dataset as an example, below are my reproduction results: Screenshot_20240318115915 dhcweight_A_trend_epochs_160 dhcweight_B_trend_epochs_160

Hi, thanks for your attention.

  1. As can be seen in the training logs, we set three different seeds and take the average and std of the three experiments for the reported results. As for the bad reproduced results, may I ask what your environment info is, i.e., cuda version, torch and etc., I think one possible reason may be due to the environment. Furthermore, in which epoch did you get these results? Since we use early stopping, the training process may stop too early due to the randomness.

  2. For the curves, I just downloaded the weights from tensorboard and plotted, the difference may also be due to the inferior results.

  3. Yes, the organs listed in order along with their corresponding labels (i.e., 0, 1, 2, 3, ...), 0 denote the background.

McGregorWwww commented 6 months ago

I want to know the size of synapse you use.,the npy that auther upload is 512×512×d and the nii.gz that author upload is 72×144×144 ,but in processed.py both of them seems wrong.Besides,the data samples have some replication between traing , eval and test.

Hi, thanks for pointing this out and sorry for our negligence. The size should be 80×160×160. We have re-uploaded the processed dataset and the splits.

Chenzhgit commented 6 months ago

Hi, many thanks for your messages and answers.

  1. I created one conda virtual environment "semi", in "semi", some information about the packages lists below: conda 4.6.14 Python 3.7.13 pytorch 1.8.0+cu111 torchvision 0.9.0+cu111 sever: Linux 3.10.0-1160.108.1.el7.x86_64 SMP Thu Jan 25 16:17:31 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

for obtaining the results, I also used your original codes with early stop, i.e., normally the results are obtained in less than epoch 80 and when the number of epoch is around 150, the early stop happens. So is it because of the different environment information and using early stopping? But I may need to keep early stopping consistent with your experiments.

  1. As I refer to your training logs for Synapse 20p, there are three seeds (0,1,666) for three folders evaluations. Are the seed settings of other percentages of Synapse and Amos same? if it is possible, could you provide all the training logs for Synapse and Amos with different percentages? By the way the splits are different. Taking the test split as an example, the following is the test Synapse dataset split you provided more than half a year ago: Screenshot_20240319114126

but now the test Synapse split you provided is this: Screenshot_20240319114537

same differences with other splits.
Does it mean there is no significant differences for the experiments with old splits and current splits? If it does, So which splits should I use?

Thank you^^

McGregorWwww commented 6 months ago

Based on your info provided, I think it is due to the early-stopping strategy. As can be seen in the training log, the best model is in epoch 160 for fold1, and it stopped in epoch 260. So, you may use larger patience or just remove the early-stopping strategy, I think both are fine. Other ways to improve the performance: https://github.com/xmed-lab/DHC/issues/6.

The seeds remain the same, but I apologize for the loss of the other logs and code. As I attempt to recover them, I made several bugs, sorry for the inconvenience.

The final results are test with the newer split, as also can be seen in the predictions provided: https://drive.google.com/drive/u/1/folders/1J9SQ2zpjlVMZlUf3AHjhJ8aGTTbRFo-Y. This is also aligned with our improved NeurIPS work, if you are interested, please refer to https://github.com/xmed-lab/GenericSSL.

Hope this helps.

Chenzhgit commented 6 months ago

Thank you very much for your response. One more question, I also tried your updated codes and the synapse data you uploaded yesterday. But there is a bug for the codes.

Screenshot_20240320134356

Chenzhgit commented 6 months ago

@McGregorWwww Dear author, I am relatively new to semi-supervised learning and I apologize for any inconvenience caused by my questions . Regarding your latest code and the latest uploaded data splits, I encountered the last above-mentioned issue in init.py in the screenshot while replicating the experiment. Could you kindly assist me with resolving this final question? Your help is greatly appreciated. Apologies again for any inconvenience caused.

McGregorWwww commented 6 months ago

@Chenzhgit Hi, I apologize for the inconvenience, but I've been very busy these days, I'm sorry that I currently do not have the time to debug. It appears that the bug is caused by the variable 'cnt' being 0. I'll fix this as soon as I finish my current tasks.

Btw, maybe you can try this repo, as I am confident that it is bug-free. Sorry once again for any inconvenience caused.

Chenzhgit commented 6 months ago

@McGregorWwww Thanks a lot ❤