Training data split - Githubissues

akidway commented 4 months ago

Hi, @wwyi1828

Thank you for your work.

In the paper, it is mentioned that "we split the training sets into 75% training and 25% validation partitions." Does the 25% validation subset participate in the self-supervised training process? If not, was the training set split only once, or was k-fold cross-validation employed (where k-fold would imply that the self-supervised training was conducted k times)?

Additionally, I encountered some difficulties downloading the patches you provided in Dropbox due to internet issues. I used CLAM and obtained approximately 7 million patches in the training set. Also, I used preprocessing method provided by DSMIL and got approximately 10 million patches. While the paper states 2.6 million for the training set. Did you employ any sampling methods? Could you please provide further details about the preprocessing of the slides?

Best wishes.

wwyi1828 commented 4 months ago

Hi akidway,

Thank you for bringing this to my attention.

Since SSL pretraining is heavy, similar to SSL methods used in natural images, I utilized the entire training set for SSL training and evaluated the model's performance on the hold-out testing set. In natural image tasks, they usually train on the training set and then evaluate the KNN accuracy on the hold-out test set.

For the patch-level classification task, I simply predicted patches in the test set using their nearest samples' labels in the training set. This approach is parameter-free and should provide a fair comparison. For the downstream slide-level classification task, I split the training set into training and evaluation subsets to determine the best checkpoint based on the evaluation set's performance. Finally, I evaluated the model's performance on the hold-out test set.

Regarding the data preprocessing details, I used the preprocessing pipeline from my previous lab. As I have left my previous lab and the preprocessing pipeline is not public available, I cannot share it. If you encounter issues downloading the preprocessed data from Dropbox due to internet connectivity problems, the CLAM preprocessing pipeline should work as well, although it generates more backgrounds. The CLAM preprocessing preset (https://github.com/mahmoodlab/CLAM/blob/master/presets/bwh_biopsy.csv) I used recently is as follows:

#preset: bwh_biopsy

seg_params = {'seg_level': -1, 'sthresh': 15, 'mthresh': 11, 'close': 2, 'use_otsu': False,
              'keep_ids': 'none', 'exclude_ids': 'none'}

filter_params = {'a_t': 1, 'a_h': 1, 'max_n_holes': 2}

vis_params = {'vis_level': -1, 'line_thickness': 50}

patch_params = {'white_thresh': 5, 'black_thresh': 50, 'use_padding': True, 'contour_fn': 'four_pt'}

This preset generates about 3 million non-overlapping 224×224 patches for the training set.

akidway commented 4 months ago

Thank you for your response.

Regarding the preprocessing, I just used the following command to process the normal slides of C16 for training:

python create_patches_fp.py 
--source /work/lzh/data/WSI/CAMELYON16/training/normal \
--save_dir /work/lzh/data/test/normal \
--patch_size 224 \
--preset bwh_biopsy.csv \
--seg --patch

However, I obtained 4,870,165 patches(just for normal slides), which is significantly more than the expected 3 million.

To count the patches, I read the .h5 file to retrieve the coordinates of each patch and then calculated the total number of patches. Code as follow:

case = list(glob.glob("/work/lzh/data/test/normal/patches/*.h5"))
total = 0
for x in tqdm(case):
    f = h5py.File(x)
    data = f['coords']
    total += data.shape[0]
print(total)

I'm wondering if there might be an problem with my command. Could you please provide some advice on how to adjust the command to achieve the expected number of patches?

wwyi1828 commented 4 months ago

I can see two potential issues why the actual number of patches differs from your expectation:

Step size: You need to specify the step size using the --step_size argument. If you only modify the patch size without adjusting the step size, some regions may be missed during patch extraction. By default, both the step size and patch size are set to 256. In your case, since you set the patch size to 224, you should also set the step size to 224 to ensure all regions are covered, i.e., add --step_size 224 to your command.
Patch level: It's important to specify the appropriate patch level based on the desired resolution. The CAMELYON16 dataset has a maximum resolution of 40x at level 0. However, for efficiency, most researchers (including myself) process the images at 20x resolution, which corresponds to level 1 for CAMELYON16. To extract patches at 20x, add --patch_level 1 to your command. Generally, the number of patches at 20x is approximately 1/4 of the number of patches at 40x, as the resolution is reduced by a factor of 4. However, due to the background removal algorithm applied during preprocessing, the actual number of attempted patches may not be exactly 1/4.

Additionally, I suggest considering using patches in PNG format instead of H5 files. Although H5 files are more efficient when directly applying MIL algorithms, they may cause some inconvenience when performing image augmentation in subsequent steps.

akidway commented 4 months ago

Indeed, I ignored --step_size and --patch_level. Thank you for your generous advice. Best wished.

wwyi1828 / CluSiam

Training data split #3