yikang-li / FactorizableNet

Factorizable Net (Multi-GPU version): An Efficient Subgraph-based Framework for Scene Graph Generation
216 stars 38 forks source link

RuntimeError: cuda runtime error (8) : invalid device function at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:35 #5

Closed LinXin2018 closed 6 years ago

LinXin2018 commented 6 years ago

Hello author: When I trying to run your code, it rports:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCTensorMath.cu line=35 error=8 : invalid device function Traceback (most recent call last): File "train_FN.py", line 390, in <module> main() File "train_FN.py", line 277, in main use_gt_boxes=args.use_gt_boxes) File "/home/linxin/FactorizableNet/models/HDN_v2/engines_v1.py", line 123, in test use_gt_boxes=use_gt_boxes) File "/home/linxin/FactorizableNet/models/HDN_v2/factorizable_network_v4.py", line 271, in evaluate object_result, predicate_result = self.forward_eval(im_data, im_info,) File "/home/linxin/FactorizableNet/models/HDN_v2/factorizable_network_v4.py", line 232, in forward_eval pooled_object_features = self.roi_pool_object(features, object_rois).view(len(object_rois), -1) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__ result = self.forward(*input, **kwargs) File "/home/linxin/FactorizableNet/lib/roi_align/modules/roi_align.py", line 16, in forward self.spatial_scale)(features, rois) File "/home/linxin/FactorizableNet/lib/roi_align/functions/roi_align.py", line 22, in forward output = features.new(num_rois, num_channels, self.aligned_height, self.aligned_width).zero_() RuntimeError: cuda runtime error (8) : invalid device function at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:35

I have changed the lib/make.sh file since my CUDA_ARCH do not support 6.0. The make.sh seems to work for me, only having a few warning. After reading [I](https://github.com/jwyang/faster-rcnn.pytorch/issues/110

) have re-built the make.sh for a few times, the cuda error does not overcomed. /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c: In function ‘BilinearSamplerBHWD_updateGradInput’: /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:190:14: warning: unused variable ‘inBottomRight’ [-Wunused-variable] real inBottomRight=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:189:14: warning: unused variable ‘inBottomLeft’ [-Wunused-variable] real inBottomLeft=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:188:14: warning: unused variable ‘inTopRight’ [-Wunused-variable] real inTopRight=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:187:14: warning: unused variable ‘inTopLeft’ [-Wunused-variable] real inTopLeft=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:186:14: warning: unused variable ‘v’ [-Wunused-variable] real v=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c: In function ‘BilinearSamplerBCHW_updateGradInput’: /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:441:14: warning: unused variable ‘inBottomRight’ [-Wunused-variable] real inBottomRight=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:440:14: warning: unused variable ‘inBottomLeft’ [-Wunused-variable] real inBottomLeft=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:439:14: warning: unused variable ‘inTopRight’ [-Wunused-variable] real inTopRight=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:438:14: warning: unused variable ‘inTopLeft’ [-Wunused-variable] real inTopLeft=0; ^ /home/linxin/FactorizableNet/lib/roi_crop/src/roi_crop.c:437:14: warning: unused variable ‘v’ [-Wunused-variable] real v=0; ^

My environment is CUDA8.0 pytorch0.3.1 python2.7

Hope to recieve your reply!

THX

yikang-li commented 6 years ago

Can you provide more details?

yikang-li commented 6 years ago

Maybe you can try training the model with pretrained RPN --rpn /path/to/rpn

LinXin2018 commented 6 years ago

Hello!! author:

This error first occured when I try to evaluate the model with the pretrained model. I've add --rpn option, but it does not work! I also tried to train the model but the same error occured! Right after start training, it reported THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCTensorCopy.cu line=204 error=8 : invalid device function and in engines_v1.py it jumps to Exception handle. It seems to the roialign function does not work well. Thank You!

yikang-li commented 6 years ago

Yes, it is because that VG-DR-Net has some self-relations, e.g. A-relation-A. Please use git pull to update the latest version. Please check the Updates in README for more information.

If it works, please comment here and I will close the issue.

LinXin2018 commented 6 years ago

Thank you for update the code version. However, unfortunatelly the CUDA runtime error(8) still remains! Maybe because pytorch 0.3.1+python2.7 does not fit GTX1080Ti architecture (cuda8.0). I will try to run the code under pytorch 0.4.x

yikang-li commented 6 years ago

Sorry, I haven't met this issues before. Maybe you can try the evalaution and training on different datasets to check whether it happens only the specific or any settings. Hope to hear more about the issue.

LinXin2018 commented 6 years ago

Hello!Today, I have tried to run evaluation on DR dataset and tried to train RPN network, the same error exactly occurred! I think it's the cuda environment issue not the code having bugs. LinXin

yikang-li commented 6 years ago

I am so sorry to hear that.

I think you can use pdb to track there the bug happened. So we can check it is because of the settings of the code or the general configuration of your server.

Looking forward to any updates.

LinXin2018 commented 6 years ago

Hello, I have debug the code beffore, it jumps to engines_v1.py/line86 exception handle.

yikang-li commented 6 years ago

It ends at the line 86 because we have an exception catch there. That is not the position where the error happened. I highly recommend you to use pdb to run the code step by step to check where it actually happens.

Now I will close the issue. Feel free to leave comments at this thread.

nicholasmireles commented 5 years ago

I had the same error. It might be related to a CUDA version mismatch (at least in my case) as the pytorch installed via pip isn't compiled using the latest version. Were you able to solve this?

LinXin2018 commented 5 years ago
      I had the same error. It might be related to a CUDA version mismatch (at least in my case) as the pytorch installed via pip isn't compiled using the latest version. Were you able to solve this?

Not yet.

nicholasmireles commented 5 years ago

So in my case, my graphics card uses CUDA compute capability 7.0 and PyTorch 3.x which this project requires I'm assuming isn't compatible with that level. If it helps to debug your situation, I moved to a computer with a Titan X (CUDA 8.0, capability 6.1) and it worked fine using the instructions in the readme.

Linda-L commented 4 years ago

Thank you for update the code version. However, unfortunatelly the CUDA runtime error(8) still remains! Maybe because pytorch 0.3.1+python2.7 does not fit GTX1080Ti architecture (cuda8.0). I will try to run the code under pytorch 0.4.x

Imeet the same problem with pytorch 0.3.1+python3.6