zudi-lin / pytorch_connectomics

PyTorch Connectomics: segmentation toolbox for EM connectomics
http://connectomics.readthedocs.io/
MIT License
169 stars 77 forks source link

How to merge MitoEM output files? #63

Closed Chenliang-Gu closed 3 years ago

Chenliang-Gu commented 3 years ago

I run your code MitoEM-R-A.yaml in the MitoEM challenge, but many H5 files appear in the inference time . How can I merge them? The H5 file list is as follow: image

Michaelsqj commented 3 years ago
  1. Restricted by memory, loading the whole volume is not practical. Same as training, only part of the volumes (small chunk) will be loaded to memory one time during prediction. 0-111-0-2730-1365-4096 represents the coordinates of that chunk z_min-z_max-x_min-x_max-y_min-y_max, for more details about the coordinates you can take a look at https://github.com/zudi-lin/pytorch_connectomics/blob/87ab8fcddd8841f03fe8f7f4dc2b8c075815a4ff/connectomics/engine/trainer.py#L272 and https://github.com/zudi-lin/pytorch_connectomics/blob/87ab8fcddd8841f03fe8f7f4dc2b8c075815a4ff/connectomics/data/dataset/dataset_tile.py#L110. You can specify how many parts you want on each direction in the configuration here https://github.com/zudi-lin/pytorch_connectomics/blob/87ab8fcddd8841f03fe8f7f4dc2b8c075815a4ff/configs/MitoEM/MitoEM-R-BC.yaml#L25 Here, [8,2,2] means that you separate the volume into 8 parts on z direction, 2 parts on x direction and 2 parts on y direction, which means there will be 32 chunks (32 .h5 prediction files) in total.

  2. In terms of merging the volumes, simply loading these prediction files and adding them up (there will be some overlaps, as you can see from the coordinates).

Michaelsqj commented 3 years ago
def agglomerate(preddir):
    files = glob.glob(preddir+'*.h5')
    files = sorted(files)
    coords = [None]*len(files)
    for _, file in enumerate(files):
        files[_] = os.path.basename(file)
        file = os.path.basename(file).split('.')[0].split('_')[1]
        start_z = int(files[_][:3])
        coords[_] = [int(x) for x in file.split('-')]
        coords[_][0] += start_z-500
        coords[_][1] += start_z-500

    print([(coords[_], files[_]) for _ in range(len(files))])
    coords = np.array(coords)
    coordmax = np.max(coords, axis=1)
    coordmin = np.min(coords, axis=1)
    volume = np.zeros((2, 500, 4096, 4096), dtype=np.uint16)
    covered = np.zeros(volume.shape[1:])
    print('create a volume', volume.shape, volume.dtype)
    # assert volume.shape == (2, 500, 4096, 4096)

    for _, file in enumerate(files):
        print('agglomerate ', _, coords[_])
        time1 = time.time()
        zmin, zmax, ymin, ymax, xmin, xmax = coords[_]
        # ww = blend_gaussian((zmax-zmin, ymax-ymin, xmax-xmin))
        covered[zmin:zmax, ymin:ymax, xmin:xmax] += 1
        temp = readvol(os.path.join(preddir, file))
        volume[:, zmin:zmax, ymin:ymax, xmin:xmax] += temp
        # collect memory
        del temp
        gc.collect()
        print('agglomerate time', time.time()-time1)
        show_mem()
    volume = (volume/covered.astype(float))
    # collect memory
    del covered
    gc.collect()
    return volume.astype(np.uint8)

Maybe you can try this simple script.

Before using the function, you need to make sure your filenames should be in the format *_zmin-zmax-xmin-xmax-ymin-ymax_*.h5. * can be anything str type. The .h5 files for the whole volume should also be in the same folder.

zudi-lin commented 3 years ago

It seems that the output name is not correct when using the TileDataset with test-time augmentation.

Chenliang-Gu commented 3 years ago

hello,I'll try this script. But it seems not to save into images,as I have to use the MitoEM-challenge submission utility functions to convert the test images to h5 files to submit. https://github.com/donglaiw/MitoEM-challenge/blob/main/aux/convert_images_into_h5.py

zudi-lin commented 3 years ago

hi @biyuruofeng, the util function provided by @Michaelsqj first combine the output volumes into a single numpy array. You may take that numpy array and save it as a h5 file for submission. Please let me know if you have other questions!

Chenliang-Gu commented 3 years ago

It seems the output name is not correct for the script ,so I have to rename the output files ?

zudi-lin commented 3 years ago

Yea, there was a bug in generating the output file name. I have fixed that today: https://github.com/zudi-lin/pytorch_connectomics/commit/5f3a00c4d479532b1b95d1a003e8700a45fb50c3. For your convenience, it's easier to just change the file names to avoid running inference again. Or you may pull the latest commit and run prediction.

Chenliang-Gu commented 3 years ago

hello, I run the function but get the error:MemoryError: Unable to allocate 62.5 GiB for an array with shape (500, 4096, 4096) and data type float64

zudi-lin commented 3 years ago

Before running segmentation algorithm, the volumes contain the probability score of the affinity prediction, which is rescaled to [0,255] in uint8 format. For your case you use a dtype of float64, which is too large. Please note that the model prediction is not the finally segmentation, but the affinity or contour map.

Chenliang-Gu commented 3 years ago

hello,I meet various errors when running this function.Can you give me an entire tutorial or code to run the MitoEM-challenge baseline code,including submission? errors like volume[:, zmin:zmax, ymin:ymax, xmin:xmax] += temp ValueError: operands could not be broadcast together with shapes (2,111,2730,2730) (3,111,2730,2730) (2,111,2730,2730)

Chenliang-Gu commented 3 years ago

and I don't understand why start_z(min_z) need to minus 500

zudi-lin commented 3 years ago

@biyuruofeng I will update the tutorial with more details next week.

zudi-lin commented 3 years ago

@biyuruofeng please check the updated tutorial: https://zudi-lin.github.io/pytorch_connectomics/build/html/tutorials/mito.html#instance-segmentation

Chenliang-Gu commented 3 years ago

hello,I have tried your code,but MitoEM-challenge's commit code is convert_images_into_h5.py:

"""
Example of how predicted images can be converted into an H5 file. Notice that,
connected components is applied to label different mitochondria to match instance 
segmentation problem.
You should modify the following variables:
    - pred_dir : path to the directory from which the images will be read
    - h5file_name : name of the H5 file to be created (follow the instructions in
                    https://mitoem.grand-challenge.org/Evaluation/ to name the 
                    files accordingly)
The H5 file should be saved in the directory where this script was called
"""

import os                                                                       
import h5py
import numpy as np                                                              
from skimage.io import imread
from tqdm import tqdm
from skimage import measure, feature
from scipy import ndimage
from PIL import ImageEnhance, Image
from os import path

img_shape = (4096, 4096)
pred_dir = 'binarized_50ov'
pred_ids = sorted(next(os.walk(pred_dir))[2])  
h5file_name = '0_human_instance_seg_pred.h5'

# Allocate memory for the predictions
pred_stack = np.zeros((len(pred_ids),) + img_shape, dtype=np.int64)

# Read all the images
for n, id_ in tqdm(enumerate(pred_ids)):
    img = imread(os.path.join(pred_dir, id_))
    pred_stack[n] = img

# Downsample by 2
#pred_stack = pred_stack[:,::2,::2] 

# Apply connected components to make instance segmentation
pred_stack = (pred_stack / 255).astype('int64')
pred_stack, nr_objects = ndimage.label(pred_stack)
print("Number of objects {}".format(nr_objects))

# Create the h5 file (using lzf compression to save space)
h5f = h5py.File(h5file_name, 'w')
h5f.create_dataset('dataset_1', data=pred_stack, compression="lzf")
h5f.close()

https://github.com/donglaiw/MitoEM-challenge/blob/main/aux/convert_images_into_h5.py It seems that your code's result is not suitable for this code,so what can I do to use this code to commit?

zudi-lin commented 3 years ago

hi @biyuruofeng, sorry I did not get your question. The model outputs are several .h5 files. After merging and segmentation with step 6 in https://zudi-lin.github.io/pytorch_connectomics/build/html/tutorials/mito.html#instance-segmentation, it will become one single volume. You can directly save it as an .h5 file for submission (https://mitoem.grand-challenge.org/Evaluation/).

The tutorial is only for the rat dataset. The same operations need to be done for the other MitoEM-H volume.

Chenliang-Gu commented 3 years ago

hello,I've tried your code,and out of memory,but I have 128G memory,I mean RAM. so It seems your code needs a lot of memory.Have you ever tried your code? and have this ever happened?

import glob
import numpy as np
from connectomics.data.utils import readvol
from connectomics.utils.processing import bc_watershed

output_files = 'outputs/MitoEM_R_BC/test/*.h5' # output folder with chunks
chunks = glob.glob(output_files)

vol_shape = (2, 500, 4096, 4096) # MitoEM test set
pred = np.ones(vol_shape, dtype=np.uint8)
for x in chunks:
    pos = x.strip().split("/")[-1]
    print("process chunk: ", pos)
    pos = pos.split("_")[-1].split(".")[0].split("-")
    pos = list(map(int, pos))
    chunk = readvol(x)
    pred[:, pos[0]:pos[1], pos[2]:pos[3], pos[4]:pos[5]] = chunk

# This function process the array in numpy.float64 format.
# Please allocate enough memory for processing.
segm = bc_watershed(pred, thres1=0.85, thres2=0.6, thres3=0.8, thres_small=1024)
zudi-lin commented 3 years ago

Hi @biyuruofeng, yes, I tried 100G memory for the post-processing and it works for me. Do you have other programs running? The watershed function requires numpy.float64, which is memory-consuming,

Chenliang-Gu commented 3 years ago

hello,I don't have other programs running,and what's more,I get 64 result file as follows: image so your code pos = pos.split("_")[-1].split(".")[0].split("-")should be pos = pos.split("_")[1].split(".")[0].split("-") Are the H5 files I got correct? In addition, can I change float64 to float16 to save memory?

Chenliang-Gu commented 3 years ago

And what'more,Merging images and then using watershed consumes too much memory. Can I use watershed image one by one? What should I do?

zudi-lin commented 3 years ago

hi @biyuruofeng, yes, applying watershed one-by-one and merging them is doable. The only problem is that for objects touching the volume border there might be over-segmentation. But most of the objects should be fine. And thanks for the bug report split("_")[1]. Will change that soon.

Chenliang-Gu commented 3 years ago

For most people, they don't have over 100G memory, even me have over 100G ,also over memory.so can you update the code of watershed algorithm one by one?

zudi-lin commented 3 years ago

@biyuruofeng Thank you for the suggestion. But we provide the code just for the purpose of providing a baseline for comparison. We encourage all participants to apply novel approaches for accurate and efficient mitochondria segmentation.