Considering Preprocess for RICORD dataset

keeplearning-again commented 4 months ago

Congrats! It is a fantastic job! I met some problems with repeating the downstream task -- with dataset RICORD. From the website you have given, I found folders with dicom files and they are not matching with your nii.gz file before spacing. So can you detailedly explain how you preprocess RICORD dataset from the beginning?

yeerwen commented 4 months ago

Hi,

Sorry for missing an important preprocessing detail. Before running the 'Preprocess/RICORD.py' file, one step is needed: Convert the series of DICOM files into a single NIfTI file.

I hope this helps you.

keeplearning-again commented 4 months ago

Thanks for your helpful reply! Still got a question about the conversion step. I use several tools, like dicom2nifti(python package) and dcm2niix(command) etc. But I found there is still some dicom files which can not be converted, due to lacking of header information or missing dicom files(length of files is less than InstanceNumber) or unmatched spacing. Have you ever met these problems before?

yeerwen commented 4 months ago

The dataset utilized was processed in the past. I am now re-downloading the original dataset and will attempt reprocessing. I shall update you with my findings; your patience is appreciated.

yeerwen commented 4 months ago

Hello,

The original RICORD dataset includes some dirty data, leading us to exclude them. Specifically, we omitted two types of data: (1) Slices that differed in shape from the majority, such as '3.000000-NA-11240'. (Only remove the slice) (2) Volumes that either contain irrelevant information ('1.000000-Scout 4CM ABOVE STERNAL NOTCH-60100') or have a small number of slices ('200.000000-Smart Prep Series-27845' and '200.000000-Smart Prep Series-02765').

In total, 19 samples were removed from the dataset. You can verify by following the data split (/Downstream/Dim_3/RICORD/data_split) we have provided.

keeplearning-again commented 4 months ago

I still got some questions about your two omitting steps you mentioned before.

for the step 1, I still found files with "3.000000-NA-11240" in train split file.

for the remaining 330 file excluding you have mentioned in step two, I used

import dicom2nifti; dicom2nifti.dicom_series_to_nifti(dicom_folder, output_file, reorient_nifti=True) # convert dicom files to nifti file

and got the message below:

RICORD/manifest-1608266677008/MIDRC-RICORD-1A/MIDRC-RICORD-1A-SITE2-000245/02-12-2002-NA-NA-23720/5.000000-NA-71388
Missing slices (slice count mismatch between timepoint 0 and 1)
---------------------------------------------------------
(512, 512, 41)
(512, 512, 2)
---------------------------------------------------------
Traceback (most recent call last):
File "test_RICORD.py", line 245, in <module>
dicom_to_nifti(folder, nii_file_name_without_suffix)
File "test_RICORD.py", line 212, in dicom_to_nifti
dicom2nifti.dicom_series_to_nifti(dicom_folder, output_file, reorient_nifti=True)
File "/mnt/users/anaconda3/envs/vit/lib/python3.8/site-packages/dicom2nifti/convert_dicom.py", line 78, in dicom_series_to_nifti
return dicom_array_to_nifti(dicom_input, output_file, reorient_nifti)
File "/mnt/users/anaconda3/envs/vit/lib/python3.8/site-packages/dicom2nifti/convert_dicom.py", line 118, in dicom_array_to_nifti
results = convert_generic.dicom_to_nifti(dicom_list, output_file)
File "/mnt/users/anaconda3/envs/vit/lib/python3.8/site-packages/dicom2nifti/convert_generic.py", line 249, in dicom_to_nifti
return four_d_to_nifti(grouped_dicoms, output_file)
File "/mnt/users/anaconda3/envs/vit/lib/python3.8/site-packages/dicom2nifti/convert_generic.py", line 82, in four_d_to_nifti
full_block = _get_full_block(grouped_dicoms)
File "/mnt/users/anaconda3/envs/vit/lib/python3.8/site-packages/dicom2nifti/convert_generic.py", line 127, in _get_full_block
raise ConversionError("MISSING_DICOM_FILES")
dicom2nifti.exceptions.ConversionError: MISSING_DICOM_FILES

Besides another common error is that slice increments are consistent within the DICOM files

yeerwen commented 4 months ago

A1: What I said was "(1) Slices that differed in shape from the majority, such as '3.000000-NA-11240'. (Only remove the slice)". We only remove the erroneous slices, rather than excluding the entire volume. This is the reason "3.000000-NA-11240" is still present in the training file.

A2: I've converted the 'MIDRC-RICORD-1A-SITE2-000245/02-12-2002-NA-NA-23720/5.000000-NA-71388' files into a nii.gz format successfully using your provided code.

A3: Apologies, but I'm not quite sure I understand what you mean. Could you please elaborate?

keeplearning-again commented 4 months ago

for A2, I found some strange folders which cannot be correctly converted, like "MIDRC-RICORD-1A-SITE2-000028/01-21-2001-NA-NA-47691/2.000000-NA-89760", "MIDRC-RICORD-1A-SITE2-000048/03-11-2004-NA-NA-28171/2.000000-NA-00632", "MIDRC-RICORD-1A-SITE2-000235/03-19-2002-NA-NA-15609/5.000000-NA-38602", "MIDRC-RICORD-1A-SITE2-000245/01-15-2002-NA-NA-80471/2.000000-NA-58989", "MIDRC-RICORD-1A/MIDRC-RICORD-1A-SITE2-000236/07-15-2004-NA-NA-30714/2.000000-NA-29336" etc.

yeerwen commented 4 months ago

Hi, the code below seems to work. You can try! def dicom_series_to_nifti(dicom_dir, output_file):

dicom_files = [pydicom.dcmread(os.path.join(dicom_dir, f))
               for f in os.listdir(dicom_dir) if f.endswith('.dcm')]

print(dicom_dir)

dicom_files.sort(key=lambda x: int(x.InstanceNumber))

dicom_slices = [file.pixel_array for file in dicom_files]

ori_len = len(dicom_slices)
dicom_slices = remove_outliers(dicom_slices)
now_len = len(dicom_slices)
if now_len != ori_len:
    print(f"{ori_len} -> {now_len}")
image_data = np.stack(dicom_slices, axis=-1)

print(image_data.shape)
affine = np.eye(4)

nifti_image = nib.Nifti1Image(image_data, affine)

nib.save(nifti_image, output_file)

keeplearning-again commented 4 months ago

I just implement remove_outliers() by myself and am wondering whether it follows your procedure.

def remove_outliers(dicom_slices, z_thresh=3):
    pixel_intensities = [slice.ravel() for slice in dicom_slices]
    mean = np.mean(pixel_intensities, axis=1)
    std = np.std(pixel_intensities, axis=1)
    filtered_slices = []
    for slice, m, s in zip(dicom_slices, mean, std):
        z_scores = (slice - m) / s
        if np.all(np.abs(z_scores) < z_thresh):
            filtered_slices.append(slice)
    return filtered_slices

keeplearning-again commented 3 months ago

Hi, the code below seems to work. You can try! def dicom_series_to_nifti(dicom_dir, output_file):

dicom_files = [pydicom.dcmread(os.path.join(dicom_dir, f))
               for f in os.listdir(dicom_dir) if f.endswith('.dcm')]

print(dicom_dir)

dicom_files.sort(key=lambda x: int(x.InstanceNumber))

dicom_slices = [file.pixel_array for file in dicom_files]

ori_len = len(dicom_slices)
dicom_slices = remove_outliers(dicom_slices)
now_len = len(dicom_slices)
if now_len != ori_len:
    print(f"{ori_len} -> {now_len}")
image_data = np.stack(dicom_slices, axis=-1)

print(image_data.shape)
affine = np.eye(4)

nifti_image = nib.Nifti1Image(image_data, affine)

nib.save(nifti_image, output_file)

Hi, I am wondering whether there is a hint for function remove_outliers()?

yeerwen / MedCoSS

Considering Preprocess for RICORD dataset #3