segfault when training or inferring

Jarrome commented 4 years ago

When I run inference.sh, I got

    gpu: 0
    output_dir: ./example_data/results
    num_samples: 64
    checkpoint: ./ckpt/checkpoint.ckpt
    base_scale: 2.0
    data_dir: ./example_data
    num_points: -1
    model: 3DFeatNet
    feature_dim: 32
    randomize_points: True
    use_keypoints_from: None
    data_dim: 6
    max_keypoints: 1024
    min_response_ratio: 0.01
    nms_radius: 0.5
2019-12-04 14:52:33,039 [DEBUG] __main__ - In compute_descriptors()
2019-12-04 14:52:33,039 [INFO] __main__ - Computed descriptors will be saved to ./example_data/results
2019-12-04 14:52:33,039 [INFO] __main__ - Found 4 bin files in directory: ./example_data, each assumed to be of dim 6
2019-12-04 14:52:33,039 [INFO] Feat3dNet - Model parameters: {'num_samples': 64, 'NoRegress': False, 'Attention': True, 'BaseScale': 2.0, 'feature_dim': 32, 'num_clusters': -1}
Segmentation fault

With

Python 3.5.3
tensorflow-gpu 1.14.0
cuda 10.0

What might be the problem?

yewzijian commented 4 years ago

Hi Jarrome,

I’m not sure what might be causing the problem. I’m suspecting is the custom tensorflow ops though. Are the ops compiled using the same cuda version?

Zi Jian

On Wed, 4 Dec 2019 at 9:56 PM, Jarrome notifications@github.com wrote:

When I run inference.sh, I got
gpu: 0
output_dir: ./example_data/results
num_samples: 64
checkpoint: ./ckpt/checkpoint.ckpt
base_scale: 2.0
data_dir: ./example_data
num_points: -1
model: 3DFeatNet
feature_dim: 32
randomize_points: True
use_keypoints_from: None
data_dim: 6
max_keypoints: 1024
min_response_ratio: 0.01
nms_radius: 0.5
2019-12-04 14:52:33,039 [DEBUG] main - In compute_descriptors() 2019-12-04 14:52:33,039 [INFO] main - Computed descriptors will be saved to ./example_data/results 2019-12-04 14:52:33,039 [INFO] main - Found 4 bin files in directory: ./example_data, each assumed to be of dim 6 2019-12-04 14:52:33,039 [INFO] Feat3dNet - Model parameters: {'num_samples': 64, 'NoRegress': False, 'Attention': True, 'BaseScale': 2.0, 'feature_dim': 32, 'num_clusters': -1} Segmentation fault

With

Python 3.5.3 tensorflow-gpu 1.14.0 cuda 10.0

What might be the problem?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yewzijian/3DFeatNet/issues/11?email_source=notifications&email_token=ADIBP67XLGEOY7REJ2FG3F3QW6ZJ5A5CNFSM4JVJAIXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6AL7NQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBP64RAPRF5LW4BFX27ATQW6ZJ5ANCNFSM4JVJAIXA .

Jarrome commented 4 years ago

Yes, with cuda-10.0. I will try inference.sh back with my laptop (cuda-8 probably), perhaps will not raise the segfault.

yewzijian commented 4 years ago

Ok, then that’s weird. I currently away and don’t have access to my computer these few days. I suggest trying with an older version of Tensorflow. Perhaps something broke in the new version.

Will test the code out on TF1.14 when I get back next week.

Zi Jian

On Wed, 4 Dec 2019 at 10:16 PM, Jarrome notifications@github.com wrote:

Yes, with cuda-10.0

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/yewzijian/3DFeatNet/issues/11?email_source=notifications&email_token=ADIBP6Z4MXC7WWRMW5YTHETQW63SVA5CNFSM4JVJAIXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5FCHA#issuecomment-561664284, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBP62WAZXGLA6XN4XF4WDQW63SVANCNFSM4JVJAIXA .

Jarrome commented 4 years ago

Hi Zi Jian,

My pc is with python 3.6.8, tf-gpu1.11 and cuda9.0 and it works well, except for the resource issue ;)

While the previous tf 1.14, I dont know, but other version is not compatible with cuda10.0.

yewzijian commented 4 years ago

Hi Jarrome,

Yes, you're right, the code segfaults when running the custom ops on Tensorflow 1.14 (but weirdly seems to run fine on TF1.15).

I have no idea how to fix this, sorry. My recommendation is to stick with an older version of Tensorflow.

Jarrome commented 4 years ago

Thank you, Zi Jian. I appreciate your help ;)

I changed to another system and it finally runs smoothly.

Here is the setting:

GeForce RTX 2080 Ti Driver Version: 418.74 Cuda 10.0 (seems cuda 10.1 not compatible to tf-gpu) tensorflow-gpu 1.13.1

Then for tensorflow.python.framework.errors_impl.NotFoundError , follow issue of original repo of tf_op, uncommen -D_GLIBCXX_USE_CXX11_ABI=0

yewzijian / 3DFeatNet

segfault when training or inferring #11