Large memory costs during evaluation

ShichenLiu commented 5 years ago

Hi,

Thanks for the excellent project! I use multiple RTX 2080 to run the code. However, the code causes OOM during evaluation (eval.sh). Since the batch size is 1 so it only uses a single GPU. Yet I could not figure out why it doesnt cause OOM during training.

Can you give an example that which kind of GPU is enough for testing? Thanks!

tiffany61706 commented 5 years ago

I think it is because that the image size of rgb image is (1200, 1600) during evaluation . However, the image size in training procedure is (512, 640) which cause out of memory during training.

Hope this answer will help you!

whubaichuan commented 4 years ago

@ShichenLiu Hi, have you solve the problem?

RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 10.73 GiB total capacity; 7.34 GiB already allocated; 1.63 GiB free; 997.85 MiB cached)

whubaichuan commented 4 years ago

@tiffany61706 Hi But if i use the image(512,640), it comes errors as “AssertionError：assert np_img.shape[:2] == (1200, 1600) in /MVSNet_pytorch/datasets/dtu_yao_eval.py", line 63, in read_img”

And after i change the code to assert np_img.shape[:2] == (512, 640), it comes the error as "RuntimeError：The size of tensor a (31) must match the size of tensor b (32) at non-singleton dimension 3 in /MVSNet_pytorch/models/mvsnet.py".

So I think it is not a good idea to changed the size of input image.

tiffany61706 commented 4 years ago

@whubaichuan hi If you want to use (1200,1600) as input without causing OOM, I suggest that you can add a few lines in mvsnet.py such as below. I think OOM won't happen again.

whubaichuan commented 4 years ago

@tiffany61706

Thanks for your reply. I just try your methods but it failed again.

RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 10.73 GiB total capacity; 7.34 GiB already allocated; 1.63 GiB free; 997.85 MiB cached)

tiffany61706 commented 4 years ago

@whubaichuan hi Can you paste all the terminal message?

whubaichuan commented 4 years ago

@tiffany61706 hi, here is all the terminal message.

Traceback (most recent call last): File "eval.py", line 302, in save_depth() File "eval.py", line 113, in save_depth outputs = model(sample_cuda["imgs"], sample_cuda["proj_matrices"], sample_cuda["depth_values"]) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/huangbaichuan/programe/MVSNet_pytorch/models/mvsnet.py", line 139, in forward cost_reg = self.cost_regularization(volume_variance) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/huangbaichuan/programe/MVSNet_pytorch/models/mvsnet.py", line 74, in forward x = conv0 + self.conv11(x) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/huangbaichuan/anaconda3/envs/mvspytorch/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 895, in forward output_padding, self.groups, self.dilation) RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 10.73 GiB total capacity; 7.13 GiB already allocated; 1.63 GiB free; 1.19 GiB cached)

tiffany61706 commented 4 years ago

@whubaichuan sorry, I don't have wechat. I think all you need is paste a few lines in mvsnet.py . If there still having OOM problem please delete already finish array to release the memory.

soulslicer commented 4 years ago

Did anyone solve this issue? I have the same problem. I have a 11gb GPU is this not enough?

whubaichuan commented 4 years ago

@soulslicer delete some variable to release the memory

kwea123 commented 4 years ago

Original size 1600x1184 cause OOM on my 11GB GPU. I resized the image to 1152x864 and it works (costs 6831MB). Don't forget to change the instrinsics as the following:

    def read_cam_file(self, filename):
        with open(filename) as f:
            lines = f.readlines()
            lines = [line.rstrip() for line in lines]
        # extrinsics: line [1,5), 4x4 matrix
        extrinsics = np.fromstring(' '.join(lines[1:5]), dtype=np.float32, sep=' ').reshape((4, 4))
        # intrinsics: line [7-10), 3x3 matrix
        intrinsics = np.fromstring(' '.join(lines[7:10]), dtype=np.float32, sep=' ').reshape((3, 3))
        intrinsics[:2, :] /= 4
        # CHANGE K ACCORDING TO SIZE!
        intrinsics[0] *= 1152/1600
        intrinsics[1] *= 864/1200
        ###############################
        # depth_min & depth_interval: line 11
        depth_min = float(lines[11].split()[0])
        depth_interval = float(lines[11].split()[1]) * self.interval_scale
        return intrinsics, extrinsics, depth_min, depth_interval

    def read_img(self, filename):
        img = Image.open(filename)
        # RESIZE IMAGE
        img = img.resize((1152, 864), Image.BILINEAR)
        # scale 0~255 to 0~1
        np_img = np.array(img, dtype=np.float32) / 255.
        return np_img

Also, you need to change the code in eval.py:

L56
    intrinsics[0] *= 1152/1600
    intrinsics[1] *= 864/1200

L60
def read_img(filename):
    img = Image.open(filename).resize((1152, 864), Image.BILINEAR)

L268
color = ref_img[::4, ::4, :][valid_points]  # hardcoded for DTU dataset

whubaichuan commented 4 years ago

@kwea123 hi, do you know why the intrinsics should be “intrinsics[:2, :] /= 4” in read_cam_file in eval mode. But it doesn't need to “intrinsics[:2, :] /= 4” in read_cam_file in train mode. See your code of train mode here. Is that the cause of the dataset?

kwea123 commented 4 years ago

It depends on which data it uses. For training the intrinsic numbers are already preprocessed to match the size (160, 128) so there is no need for dividing. In test data however, the intrinsics is the original number, which is for size (1600, 1200), so to match the output of the network (which is 1/4 of original scale since there are two stride=2 downsample in the CNN), we need to divide the intrinsics by 4

agenthong commented 4 years ago

@kwea123 Hi! When I use your method to resize the input image to solve OOM in evalution, it occurs a problem as following:

Traceback (most recent call last):
  File "eval.py", line 314, in <module>
    save_depth()
  File "eval.py", line 118, in save_depth
    for batch_idx, sample in enumerate(TestImgLoader):
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/diskC/hzc/MVSNet_pytorch-master/datasets/dtu_yao_eval.py", line 93, in __getitem__
    imgs.append(self.read_img(img_filename))
  File "/diskC/hzc/MVSNet_pytorch-master/datasets/dtu_yao_eval.py", line 68, in read_img
    assert np_img.shape[:2] == (1152, 864)
AssertionError

Do you know how to solve it?

whubaichuan commented 4 years ago

@kwea123 Hi! When I use your method to resize the input image to solve OOM in evalution, it occurs a problem as following:

Traceback (most recent call last):
  File "eval.py", line 314, in <module>
    save_depth()
  File "eval.py", line 118, in save_depth
    for batch_idx, sample in enumerate(TestImgLoader):
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/data1/hzc/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/diskC/hzc/MVSNet_pytorch-master/datasets/dtu_yao_eval.py", line 93, in __getitem__
    imgs.append(self.read_img(img_filename))
  File "/diskC/hzc/MVSNet_pytorch-master/datasets/dtu_yao_eval.py", line 68, in read_img
    assert np_img.shape[:2] == (1152, 864)
AssertionError

Do you know how to solve it?

Notice that images will be downsized in feature extraction, plus the four- scale encoder-decoder structure in 3D regularization part, the input image size must be divisible by a factor of 32. Considering this requirement also the limited GPU memories, we downsize the image resolution from 1600×1200 to 800×600, and then crop the image patch with W = 640 and H = 512 from the center as the training input. The input camera parameters are changed accordingly.

kwea123 commented 4 years ago

@agenthong just remove that assertion... assert np_img.shape[:2] == (1152, 864) the shape is actually (864, 1152) but you don't need that assertion anyway.

Also, it seems that the author has abandoned this repo and didn't respond anymore; I have no interest in debugging others' code in details, so my response to this thread will end here. I have my own implementation, feel free to contact me if there's any bug though, thank you.

agenthong commented 4 years ago

@kwea123 Got it. Thanks anyway.

agenthong commented 4 years ago

@whubaichuan I've already downsized the image to (640,512) but still got OOM in evaluation. RuntimeError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 7.77 GiB total capacity; 6.45 GiB already allocated; 424.50 MiB free; 6.56 GiB reserved in total by PyTorch) He said it only costs around 6.8G in his way

ChenLiufeng commented 4 years ago

it occurs a problem as following：

Do you know how to solve it?@whubaichuan

Willyzw commented 3 years ago

@whubaichuan sorry, I don't have wechat. I think all you need is paste a few lines in mvsnet.py . If there still having OOM problem please delete already finish array to release the memory.

To sparse further the memory space, one can delete volume_sq_sum and volume_sum as well after the volume_variance is computed. After this change, it works for my 11G GPU without resizing the image. Hope it may help one.

isshenye commented 3 years ago

@tiffany61706 Hi But if i use the image(512,640), it comes errors as “AssertionError：assert np_img.shape[:2] == (1200, 1600) in /MVSNet_pytorch/datasets/dtu_yao_eval.py", line 63, in read_img”

And after i change the code to assert np_img.shape[:2] == (512, 640), it comes the error as "RuntimeError：The size of tensor a (31) must match the size of tensor b (32) at non-singleton dimension 3 in /MVSNet_pytorch/models/mvsnet.py".

So I think it is not a good idea to changed the size of input image.

`Traceback (most recent call last):

File "eval.py", line 307, in save_depth()

File "eval.py", line 118, in save_depth outputs = model(sample_cuda["imgs"], sample_cuda["proj_matrices"], sample_cuda["depth_values"])

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs)

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs)

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs)

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs)

File "/home/amax/shenye/colmapTest1/MVSNet_pytorch-master/models/mvsnet.py", line 132, in forward cost_reg = self.cost_regularization(volume_variance)

File "/home/amax/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs)

File "/home/amax/shenye/colmapTest1/MVSNet_pytorch-master/models/mvsnet.py", line 66, in forward x = conv4 + self.conv7(x)

RuntimeError: The size of tensor a (31) must match the size of tensor b (32) at non-singleton dimension 3`

Hello,I got the same problem with you,how do you deal this problem?

doubleZ0108 commented 2 years ago

Original size 1600x1184 cause OOM on my 11GB GPU. I resized the image to 1152x864 and it works (costs 6831MB). Don't forget to change the instrinsics as the following:

    def read_cam_file(self, filename):
        with open(filename) as f:
            lines = f.readlines()
            lines = [line.rstrip() for line in lines]
        # extrinsics: line [1,5), 4x4 matrix
        extrinsics = np.fromstring(' '.join(lines[1:5]), dtype=np.float32, sep=' ').reshape((4, 4))
        # intrinsics: line [7-10), 3x3 matrix
        intrinsics = np.fromstring(' '.join(lines[7:10]), dtype=np.float32, sep=' ').reshape((3, 3))
        intrinsics[:2, :] /= 4
        # CHANGE K ACCORDING TO SIZE!
        intrinsics[0] *= 1152/1600
        intrinsics[1] *= 864/1200
        ###############################
        # depth_min & depth_interval: line 11
        depth_min = float(lines[11].split()[0])
        depth_interval = float(lines[11].split()[1]) * self.interval_scale
        return intrinsics, extrinsics, depth_min, depth_interval

    def read_img(self, filename):
        img = Image.open(filename)
        # RESIZE IMAGE
        img = img.resize((1152, 864), Image.BILINEAR)
        # scale 0~255 to 0~1
        np_img = np.array(img, dtype=np.float32) / 255.
        return np_img

Also, you need to change the code in eval.py:

L56
    intrinsics[0] *= 1152/1600
    intrinsics[1] *= 864/1200

L60
def read_img(filename):
    img = Image.open(filename).resize((1152, 864), Image.BILINEAR)

L268
color = ref_img[::4, ::4, :][valid_points]  # hardcoded for DTU dataset

Follow this guide, it finally work on my device(Tesla K80 11G)!!!🎉 And also, I have manually resize the images to 600*800, and change the camera instrinsics myself, but it failed. Besides, when finish depth map generated, it failed as below

  File "eval.py", line 269, in filter_depth
    color = ref_img[::4, ::4, :][valid_points]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 300 but corresponding boolean dimension is 216

The final .pfm depth map is nearly empth and cannot generate point cloud files.

Innocence4822 commented 2 years ago

@whubaichuan sorry, I don't have wechat. I think all you need is paste a few lines in mvsnet.py . If there still having OOM problem please delete already finish array to release the memory.

To sparse further the memory space, one can delete and as well after the is computed. After this change, it works for my 11G GPU without resizing the image. Hope it may help one.volume_sq_sum``volume_sum``volume_variance

hello,I modified the code according to your method, but the following problems still occurred @Willyzw

Traceback (most recent call last): File "/home/ly/Work/MVSNet_pytorch/train.py", line 276, in train() File "/home/ly/Work/MVSNet_pytorch/train.py", line 134, in train loss, scalar_outputs, image_outputs = train_sample(sample, detailed_summary=do_summary) File "/home/ly/Work/MVSNet_pytorch/train.py", line 194, in train_sample outputs = model(sample_cuda["imgs"], sample_cuda["proj_matrices"], sample_cuda["depth_values"]) File "/home/ly/anaconda3/envs/MVSPY/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/ly/anaconda3/envs/MVSPY/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/ly/anaconda3/envs/MVSPY/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/ly/Work/MVSNet_pytorch/models/mvsnet.py", line 129, in forward warped_volume = homo_warping(src_fea, src_proj, ref_proj, depth_values) File "/home/ly/Work/MVSNet_pytorch/models/module.py", line 124, in homo_warping proj_xy = torch.stack((proj_x_normalized, proj_y_normalized), dim=3) # [B, Ndepth, H*W, 2] RuntimeError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 0; 15.78 GiB total capacity; 14.58 GiB already allocated; 210.75 MiB free; 573.00 KiB cached)

Innocence4822 commented 2 years ago

@whubaichuan I'm sorry to bother you. Have you solved this problem? Can you tell me how to solve this problem?

zhao-you-fei commented 2 years ago

Where should the code be changed to run on multiple cards?

Traceback (most recent call last): File "/home/camellia/zyf/MVSNet_pytorch-master/eval.py", line 302, in save_depth() File "/home/camellia/zyf/MVSNet_pytorch-master/eval.py", line 113, in save_depth outputs = model(sample_cuda["imgs"], sample_cuda["proj_matrices"], sample_cuda["depth_values"]) File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/camellia/zyf/MVSNet_pytorch-master/models/mvsnet.py", line 123, in forward warped_volume = homo_warping(src_fea, src_proj, ref_proj, depth_values) File "/home/camellia/zyf/MVSNet_pytorch-master/models/module.py", line 127, in homo_warping warped_src_fea = F.grid_sample(src_fea, grid.view(batch, num_depth * height, width, 2), mode='bilinear', File "/home/camellia/anaconda3/envs/mvsnet-pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 3836, in grid_sample return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners) RuntimeError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 11.77 GiB total capacity; 6.45 GiB already allocated; 1.92 GiB free; 8.33 GiB reserved in total by PyTorch)

I have two 3060 cards. How can I solve this problem? thank you!

zhao-you-fei commented 2 years ago

@whubaichuan I'm sorry to bother you. Have you solved this problem? Can you tell me how to solve this problem?

Did you solve this problem? How did you solve it?

whubaichuan commented 2 years ago

@whubaichuan I've already downsized the image to (640,512) but still got OOM in evaluation. RuntimeError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 7.77 GiB total capacity; 6.45 GiB already allocated; 424.50 MiB free; 6.56 GiB reserved in total by PyTorch) He said it only costs around 6.8G in his way

@agenthong please refer to my project MVSNet

whubaichuan commented 2 years ago

it occurs a problem as following：

Do you know how to solve it?@whubaichuan

@ChenLiufeng please refer to my project MVSNet

whubaichuan commented 2 years ago

@whubaichuan I'm sorry to bother you. Have you solved this problem? Can you tell me how to solve this problem?

@Innocence4822 please refer to my project MVSNet

whubaichuan commented 2 years ago

@whubaichuan I'm sorry to bother you. Have you solved this problem? Can you tell me how to solve this problem?

Did you solve this problem? How did you solve it?

@zhao-you-fei please refer to my project MVSNet

zhao-you-fei commented 2 years ago

@whubaichuan I'm sorry to bother you. Have you solved this problem? Can you tell me how to solve this problem?

Did you solve this problem? How did you solve it?

@zhao-you-fei please refer to my project MVSNet Changed mvsnet.py? Does this approach solve the memory problem?

whubaichuan commented 2 years ago

@zhao-you-fei yes, solved

xy-guo / MVSNet_pytorch

Large memory costs during evaluation #3

`Traceback (most recent call last):