zzzxxxttt / pytorch_simple_CornerNet

A simple pytorch implementation of CornerNet
30 stars 2 forks source link

Segmentation Fault #6

Closed Fraser-McLean closed 3 years ago

Fraser-McLean commented 3 years ago

Hi, great work. I am attempting to use the model on a different dataset and am experiencing a problem. Whenever I try to run training, I segmentation fault. My issue appears to be when the line cnv_tl = self.cnvs_tl[ind](cnv) is run in the hourglass.py file. More specifically causing an issue in the forward function of the pooling layer when pool1 = self.pool1(self.p1_conv1(x)) is executed. I think it is the bn part of the convolution where my seg fault occurs.

This always occurs on the first epoch. I am using pytorch 1.1.0 and python 3.7. I am not using the DistributedDataParallel training stuff.

If you have any suggestions as to why this is happening it would be much appreciated.

zzzxxxttt commented 3 years ago

Dose the segment fault occurs at the first step, or randomly occurs after a few steps? And what is your input resolution? For hourglass it should be divisible by 128.

Fraser-McLean commented 3 years ago

I looks like it appears on the first step. The input to the hourglass model is [2,3,511,511] (i have batch size 2). I have tried changing the defaults so that it is [2,3,512,512] but this made no difference.

I had a look at the original github for cornernet, and some people complained about segmentation faults if their cpool didn't compile correctly. I am using a cluster and the only gcc available to me is 4.8.5, not the minimum 4.9.4 as stated here. Is this likely to be my problem. And is there any potential workaround for me if this is my issue?

zzzxxxttt commented 3 years ago

A possible workaround is to remove corner pooling, of course the performance will drop a bit. Or you can try the CenterNet, which does not use corner pooling and has comparable performance.

Fraser-McLean commented 3 years ago

I managed to change my gcc version through conda and that seemed to fix this problem. Thanks for your help!

If anyone else has this problem I ran:

conda install -c psi4 gcc-5
conda install -c anaconda libstdcxx-ng

You have to make a change to the src files of cpool (e.g. add a new line) to ensure that they are recompiled