AliBuildsAI commented 7 years ago

screenshot from 2017-08-14 19-27-53

When I run test_net.py, I encounter CUDA memory related errors (e.g. segmentation fault, CUDA error: an illegal memory access was encountered, etc). Error messages change from time to time. Anyone with the similar problems?

kevinkit commented 7 years ago

What kind of GPU do you use? DA-RNN needs at least 6gb I think. However, it may be related to other issues which different third party libraries that need to be installed correctly, see #2 / #10 . Also what kind of CUDA, cuDNN, TensorFlow and Ubuntu are you using?

AliBuildsAI commented 7 years ago

I ran the code for training with no problem, so there is probably no problem with dependencies. I have a TITAN X and a Geforce GTX gpu. CUDA version: 8.0.61 CuDNN: 5.1 Ubuntu: 16.04 Tensorflow version: 1.2.1

kevinkit commented 7 years ago

Do you give the device ID as an input parameter to your script?

Check with nvidia-smi the ID of your Titan GPU and parse it to the script. I do not know which kind of GeForce GTX GPU you have, but a TITAN should run just fine. (However, the test script did not work yet, cause of #9 )

Btw, the training scripts have not been an issue ever, while the test scripts seems to be the trouble maker.

AliBuildsAI commented 7 years ago

Yes, the device ID is 0. This is the command I ran: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh

And here is the inside of rgbd_scene_multi_rgbd_test.sh:

!/bin/bash

set -x set -e

export PYTHONUNBUFFERED="True" export CUDA_VISIBLE_DEVICES=$1

export LD_PRELOAD=/usr/lib/libtcmalloc.so.4

LOG="experiments/logs/rgbd_scene_multirgbd.txt.`date +'%Y-%m-%d%H-%M-%S'`" exec &> >(tee -a "$LOG") echo Logging output to "$LOG"

train FCN for multiple frames

time ./tools/train_net.py --gpu 0 \ --network vgg16 \ --weights data/imagenet_models/vgg16_convs.npy \ --imdb rgbd_scene_train \ --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml \ --iters 40000

if [ -f $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ] then rm $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl fi

test FCN for multiple frames

time ./tools/test_net.py --gpu 0 \ --network vgg16 \ --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt \ --imdb rgbd_scene_val \ --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml \ --rig data/RGBDScene/camera.json --kfusion 1

kevinkit commented 7 years ago

have you tried running: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh 0 instead?

kevinkit commented 7 years ago

maybe try running it with sudo

yuxng commented 7 years ago

The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.

AliBuildsAI commented 7 years ago

I ran this and there was no problem, but when I added --kfusion 1 at the end, I encountered this error:

[New Thread 0x7ffe65ffb700 (LWP 8553)] [New Thread 0x7ffe667fc700 (LWP 8554)] [New Thread 0x7ffe67fff700 (LWP 8555)] [New Thread 0x7ffe677fe700 (LWP 8556)] [New Thread 0x7ffe66ffd700 (LWP 8557)] [New Thread 0x7ffe5e22a700 (LWP 8558)] [New Thread 0x7ffe5da29700 (LWP 8559)] [New Thread 0x7ffe5d228700 (LWP 8560)] [New Thread 0x7ffe5ca27700 (LWP 8561)] [New Thread 0x7ffe4ffff700 (LWP 8562)] [New Thread 0x7ffe4f7fe700 (LWP 8563)] [New Thread 0x7ffe4effd700 (LWP 8564)] [New Thread 0x7ffe4e7fc700 (LWP 8565)] [New Thread 0x7ffe4dffb700 (LWP 8566)] [New Thread 0x7ffe4d7fa700 (LWP 8567)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault. __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:245 245 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. (gdb) quit A debugging session is active.

Inferior 1 [process 7937] will be killed.

AliBuildsAI commented 7 years ago

@kevinkit the same happens when I add 0 at the end of the command.

When I ran it with sudo, this error happens:

set -e
export PYTHONUNBUFFERED=True
PYTHONUNBUFFERED=True
export CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLEDEVICES=0 ++ date +%Y-%m-%d%H-%M-%S
LOG=experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
exec ++ tee -a experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
echo Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27 Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
'[' -f /home/aliman/DA-RNN-master/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ']'
./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1 Traceback (most recent call last): File "./tools/test_net.py", line 13, in from fcn.test import test_net File "/home/aliman/DA-RNN-master/tools/../lib/fcn/test.py", line 25, in from kinect_fusion import kfusion ImportError: libkfusion.so: cannot open shared object file: No such file or directory

(But I have libkfusion.so in DA-RNN/lib/kinect_fusion/build directory)

doomxhc commented 7 years ago

Have you ever solved the problem?I encounter the same situation and I don't know how to work it our

kevinkit commented 7 years ago

Like mentioned by @yuxng before, you can try to backtrace the problem with the gdb debugger, with the command like mentioned before:

"You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."

lizhihuit commented 6 years ago

@AliManUtd1993 ,do you compile the DA-RNN succesful? I always encounter the error in Kinect_Fusion

AliBuildsAI commented 6 years ago

I compiled all parts except kinect_fusion part.

baolinv0 commented 6 years ago

@AliManUtd1993 , do you compile the DA-RNN succesfully now? When I test_kinect_fusion.sh , it always show

ImportError: libkfusion.so: cannot open shared object file: No such file or directory

But libkfusion.so is in lib/kinect_fusion/build. And others can run succesfully.

AliBuildsAI commented 6 years ago

No, I did not try anymore.

baolinv0 commented 6 years ago

Thank you for your quick reply.

baolinv0 commented 6 years ago

@yuxng @kevinkit I meet same problem and I find the error happend at kinect_fusion.cpp => create_tensors() => initMarchingCubesTables();

And I run "You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."

it shows

6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50
, argc=14, argv=0x7fffffffdc98, init=, fini=, rtld_fini=,

stack_end=0x7fffffffdc88) at libc-start.c:287

7 0x0000000000577c2e in _start ()

Ramay7 commented 6 years ago

Hi, @beginnerFighting

Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?

Thanks for your reply!

Wei2624 commented 6 years ago

Hi, @Ramay7 , I also got the same error as you got. I am wondering if you have solved the issue or any suggestions. Thanks for your help!

Ramay7 commented 6 years ago

Hi, @Wei2624 . I have gave up on this project and didn't find any solution, sorry.... :(

gaochuan2017 commented 5 years ago

Hi, @beginnerFighting

Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?

Thanks for your reply!

I think you forget this step: Add the KinectFusion libary path

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ROOT/lib/kinect_fusion/build

Every time I start the computer this step must be excuted,otherwise you'll meet that Error.

gaochuan2017 commented 5 years ago

The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.

@yuxng I want to know how you address the malloc issue you mentioned...It seems that I meet the same Error as you... I test the trained model with the commands : sudo gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1

and get the Error in gdb :

(gdb) bt

0 malloc_consolidate (av=av@entry=0x7ffff7bb4b20 ) at malloc.c:4181

1 0x00007ffff7871cde in _int_malloc (av=av@entry=0x7ffff7bb4b20 , bytes=bytes@entry=1024) at malloc.c:3450

2 0x00007ffff7874184 in GI_libc_malloc (bytes=1024) at malloc.c:2913

3 0x00007fff973b7685 in __pyx_insert_code_object (code_object=0x7fff7e7c28b0, code_line=1390) at kinect_fusion/kfusion.cpp:6647

4 Pyx_AddTraceback (funcname=funcname@entry=0x7fff973c34c0 "kinect_fusion.kfusion.PyKinectFusion.cinit__", c_line=c_line@entry=1390, py_line=py_line@entry=32,

filename=filename@entry=0x7fff973c2362 "kinect_fusion/kfusion.pyx") at kinect_fusion/kfusion.cpp:6750

5 0x00007fff973b9931 in pyx_pf_13kinect_fusion_7kfusion14PyKinectFusioncinit (pyx_v_self=0x7fff9d997c48, __pyx_v_rig_file="")

at kinect_fusion/kfusion.cpp:1406

6 pyx_pw_13kinect_fusion_7kfusion_14PyKinectFusion_1cinit (pyx_kwds=, __pyx_args=, __pyx_v_self=0x7fff9d997c48)

at kinect_fusion/kfusion.cpp:1363

7 __pyx_tp_new_13kinect_fusion_7kfusion_PyKinectFusion (t=, a=, k=) at kinect_fusion/kfusion.cpp:5068

8 0x00000000004aaa15 in ?? ()

9 0x00000000004c166d in PyEval_EvalFrameEx ()

10 0x00000000004c141f in PyEval_EvalFrameEx ()

11 0x00000000004b9b66 in PyEval_EvalCodeEx ()

12 0x00000000004eb69f in ?? ()

13 0x00000000004e58f2 in PyRun_FileExFlags ()

14 0x00000000004e41a6 in PyRun_SimpleFileExFlags ()

15 0x00000000004938ce in Py_Main ()

16 0x00007ffff7810830 in __libc_start_main (main=0x493370
, argc=16, argv=0x7fffffffe418, init=, fini=, rtld_fini=,

stack_end=0x7fffffffe408) at ../csu/libc-start.c:291

17 0x0000000000493299 in _start ()

yuxng / DA-RNN

Error on running test code #11

!/bin/bash

export LD_PRELOAD=/usr/lib/libtcmalloc.so.4

train FCN for multiple frames

test FCN for multiple frames

6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50
, argc=14, argv=0x7fffffffdc98, init=, fini=, rtld_fini=,

7 0x0000000000577c2e in _start ()

0 malloc_consolidate (av=av@entry=0x7ffff7bb4b20 ) at malloc.c:4181

1 0x00007ffff7871cde in _int_malloc (av=av@entry=0x7ffff7bb4b20 , bytes=bytes@entry=1024) at malloc.c:3450

2 0x00007ffff7874184 in GI_libc_malloc (bytes=1024) at malloc.c:2913

3 0x00007fff973b7685 in __pyx_insert_code_object (code_object=0x7fff7e7c28b0, code_line=1390) at kinect_fusion/kfusion.cpp:6647

4 Pyx_AddTraceback (funcname=funcname@entry=0x7fff973c34c0 "kinect_fusion.kfusion.PyKinectFusion.cinit__", c_line=c_line@entry=1390, py_line=py_line@entry=32,

5 0x00007fff973b9931 in pyx_pf_13kinect_fusion_7kfusion14PyKinectFusioncinit (pyx_v_self=0x7fff9d997c48, __pyx_v_rig_file="")

6 pyx_pw_13kinect_fusion_7kfusion_14PyKinectFusion_1cinit (pyx_kwds=, __pyx_args=, __pyx_v_self=0x7fff9d997c48)

7 __pyx_tp_new_13kinect_fusion_7kfusion_PyKinectFusion (t=, a=, k=) at kinect_fusion/kfusion.cpp:5068

8 0x00000000004aaa15 in ?? ()

9 0x00000000004c166d in PyEval_EvalFrameEx ()

10 0x00000000004c141f in PyEval_EvalFrameEx ()

11 0x00000000004b9b66 in PyEval_EvalCodeEx ()

12 0x00000000004eb69f in ?? ()

13 0x00000000004e58f2 in PyRun_FileExFlags ()

14 0x00000000004e41a6 in PyRun_SimpleFileExFlags ()

15 0x00000000004938ce in Py_Main ()

16 0x00007ffff7810830 in __libc_start_main (main=0x493370
, argc=16, argv=0x7fffffffe418, init=, fini=, rtld_fini=,

17 0x0000000000493299 in _start ()

yuxng / DA-RNN

Error on running test code #11

!/bin/bash

export LD_PRELOAD=/usr/lib/libtcmalloc.so.4

train FCN for multiple frames

test FCN for multiple frames

6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50 , argc=14, argv=0x7fffffffdc98, init=, fini=, rtld_fini=,

7 0x0000000000577c2e in _start ()

0 malloc_consolidate (av=av@entry=0x7ffff7bb4b20 ) at malloc.c:4181

1 0x00007ffff7871cde in _int_malloc (av=av@entry=0x7ffff7bb4b20 , bytes=bytes@entry=1024) at malloc.c:3450

2 0x00007ffff7874184 in __GI___libc_malloc (bytes=1024) at malloc.c:2913

3 0x00007fff973b7685 in __pyx_insert_code_object (code_object=0x7fff7e7c28b0, code_line=1390) at kinect_fusion/kfusion.cpp:6647

4 Pyx_AddTraceback (funcname=funcname@entry=0x7fff973c34c0 "kinect_fusion.kfusion.PyKinectFusion.cinit__", c_line=c_line@entry=1390, py_line=py_line@entry=32,

5 0x00007fff973b9931 in pyx_pf_13kinect_fusion_7kfusion14PyKinectFusioncinit (pyx_v_self=0x7fff9d997c48, __pyx_v_rig_file="")

6 pyx_pw_13kinect_fusion_7kfusion_14PyKinectFusion_1cinit (pyx_kwds=, __pyx_args=, __pyx_v_self=0x7fff9d997c48)

7 __pyx_tp_new_13kinect_fusion_7kfusion_PyKinectFusion (t=, a=, k=) at kinect_fusion/kfusion.cpp:5068

8 0x00000000004aaa15 in ?? ()

9 0x00000000004c166d in PyEval_EvalFrameEx ()

10 0x00000000004c141f in PyEval_EvalFrameEx ()

11 0x00000000004b9b66 in PyEval_EvalCodeEx ()

12 0x00000000004eb69f in ?? ()

13 0x00000000004e58f2 in PyRun_FileExFlags ()

14 0x00000000004e41a6 in PyRun_SimpleFileExFlags ()

15 0x00000000004938ce in Py_Main ()

16 0x00007ffff7810830 in __libc_start_main (main=0x493370 , argc=16, argv=0x7fffffffe418, init=, fini=, rtld_fini=,

17 0x0000000000493299 in _start ()

6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50
, argc=14, argv=0x7fffffffdc98, init=, fini=, rtld_fini=,

2 0x00007ffff7874184 in GI_libc_malloc (bytes=1024) at malloc.c:2913

16 0x00007ffff7810830 in __libc_start_main (main=0x493370
, argc=16, argv=0x7fffffffe418, init=, fini=, rtld_fini=,