yeerwen / MedCoSS

CVPR 2024 (Highlight)
Other
84 stars 2 forks source link

Considering Preprocess for RICORD dataset #3

Open keeplearning-again opened 4 months ago

keeplearning-again commented 4 months ago

Congrats! It is a fantastic job! I met some problems with repeating the downstream task -- with dataset RICORD. From the website you have given, I found folders with dicom files and they are not matching with your nii.gz file before spacing. So can you detailedly explain how you preprocess RICORD dataset from the beginning?

yeerwen commented 4 months ago

Hi,

Sorry for missing an important preprocessing detail. Before running the 'Preprocess/RICORD.py' file, one step is needed: Convert the series of DICOM files into a single NIfTI file.

I hope this helps you.

keeplearning-again commented 4 months ago

Thanks for your helpful reply! Still got a question about the conversion step. I use several tools, like dicom2nifti(python package) and dcm2niix(command) etc. But I found there is still some dicom files which can not be converted, due to lacking of header information or missing dicom files(length of files is less than InstanceNumber) or unmatched spacing. Have you ever met these problems before?

yeerwen commented 4 months ago

The dataset utilized was processed in the past. I am now re-downloading the original dataset and will attempt reprocessing. I shall update you with my findings; your patience is appreciated.

yeerwen commented 4 months ago

Hello,

The original RICORD dataset includes some dirty data, leading us to exclude them. Specifically, we omitted two types of data: (1) Slices that differed in shape from the majority, such as '3.000000-NA-11240'. (Only remove the slice) (2) Volumes that either contain irrelevant information ('1.000000-Scout 4CM ABOVE STERNAL NOTCH-60100') or have a small number of slices ('200.000000-Smart Prep Series-27845' and '200.000000-Smart Prep Series-02765').

In total, 19 samples were removed from the dataset. You can verify by following the data split (/Downstream/Dim_3/RICORD/data_split) we have provided.

keeplearning-again commented 4 months ago

I still got some questions about your two omitting steps you mentioned before.

yeerwen commented 4 months ago

A1: What I said was "(1) Slices that differed in shape from the majority, such as '3.000000-NA-11240'. (Only remove the slice)". We only remove the erroneous slices, rather than excluding the entire volume. This is the reason "3.000000-NA-11240" is still present in the training file.

A2: I've converted the 'MIDRC-RICORD-1A-SITE2-000245/02-12-2002-NA-NA-23720/5.000000-NA-71388' files into a nii.gz format successfully using your provided code.

A3: Apologies, but I'm not quite sure I understand what you mean. Could you please elaborate?

keeplearning-again commented 4 months ago

for A2, I found some strange folders which cannot be correctly converted, like "MIDRC-RICORD-1A-SITE2-000028/01-21-2001-NA-NA-47691/2.000000-NA-89760", "MIDRC-RICORD-1A-SITE2-000048/03-11-2004-NA-NA-28171/2.000000-NA-00632", "MIDRC-RICORD-1A-SITE2-000235/03-19-2002-NA-NA-15609/5.000000-NA-38602", "MIDRC-RICORD-1A-SITE2-000245/01-15-2002-NA-NA-80471/2.000000-NA-58989", "MIDRC-RICORD-1A/MIDRC-RICORD-1A-SITE2-000236/07-15-2004-NA-NA-30714/2.000000-NA-29336" etc.

yeerwen commented 4 months ago

Hi, the code below seems to work. You can try! def dicom_series_to_nifti(dicom_dir, output_file):

dicom_files = [pydicom.dcmread(os.path.join(dicom_dir, f))
               for f in os.listdir(dicom_dir) if f.endswith('.dcm')]

print(dicom_dir)

dicom_files.sort(key=lambda x: int(x.InstanceNumber))

dicom_slices = [file.pixel_array for file in dicom_files]

ori_len = len(dicom_slices)
dicom_slices = remove_outliers(dicom_slices)
now_len = len(dicom_slices)
if now_len != ori_len:
    print(f"{ori_len} -> {now_len}")
image_data = np.stack(dicom_slices, axis=-1)

print(image_data.shape)
affine = np.eye(4)

nifti_image = nib.Nifti1Image(image_data, affine)

nib.save(nifti_image, output_file)
keeplearning-again commented 4 months ago

I just implement remove_outliers() by myself and am wondering whether it follows your procedure.

def remove_outliers(dicom_slices, z_thresh=3):
    pixel_intensities = [slice.ravel() for slice in dicom_slices]
    mean = np.mean(pixel_intensities, axis=1)
    std = np.std(pixel_intensities, axis=1)
    filtered_slices = []
    for slice, m, s in zip(dicom_slices, mean, std):
        z_scores = (slice - m) / s
        if np.all(np.abs(z_scores) < z_thresh):
            filtered_slices.append(slice)
    return filtered_slices
keeplearning-again commented 3 months ago

Hi, the code below seems to work. You can try! def dicom_series_to_nifti(dicom_dir, output_file):

dicom_files = [pydicom.dcmread(os.path.join(dicom_dir, f))
               for f in os.listdir(dicom_dir) if f.endswith('.dcm')]

print(dicom_dir)

dicom_files.sort(key=lambda x: int(x.InstanceNumber))

dicom_slices = [file.pixel_array for file in dicom_files]

ori_len = len(dicom_slices)
dicom_slices = remove_outliers(dicom_slices)
now_len = len(dicom_slices)
if now_len != ori_len:
    print(f"{ori_len} -> {now_len}")
image_data = np.stack(dicom_slices, axis=-1)

print(image_data.shape)
affine = np.eye(4)

nifti_image = nib.Nifti1Image(image_data, affine)

nib.save(nifti_image, output_file)

Hi, I am wondering whether there is a hint for function remove_outliers()?