yjxiong / temporal-segment-networks

Code & Models for Temporal Segment Networks (TSN) in ECCV 2016
BSD 2-Clause "Simplified" License
1.53k stars 477 forks source link

test error with multi gpu #249

Closed liuxiao214 closed 5 years ago

liuxiao214 commented 5 years ago

hi~ Mr.xiong, i have a problem about test with multi gpus, first i use my own caffe version, when i test the ucf101, i try change tools/eval_net.py code , like this:

caffe.init_gpu_scope([device_id]) caffe.set_device(device_id) caffe.set_mode_gpu()

and then i test on one gpu it has no error, but when i test with multi gpus, i got these errors,

F1229 13:51:41.069020 9588 gpu_memory.cpp:143] Check failed: error == cudaSuccess (3 vs. 0) initialization error *** Check failure stack trace: *** [172-16-30-14:09588] *** Process received signal *** [172-16-30-14:09588] Signal: Aborted (6) [172-16-30-14:09588] Signal code: (-6) [172-16-30-14:09588] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f5ded3965e0] [172-16-30-14:09588] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f5dec8f01f7] [172-16-30-14:09588] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7f5dec8f18e8] [172-16-30-14:09588] [ 3] /usr/lib64/libglog.so.0(+0xa7d9)[0x7f5d832c47d9] [172-16-30-14:09588] [ 4] /usr/lib64/libglog.so.0(+0xbe6d)[0x7f5d832c5e6d] [172-16-30-14:09588] [ 5] /usr/lib64/libglog.so.0(_ZN6google10LogMessage9SendToLogEv+0x24d)[0x7f5d832c7ced] [172-16-30-14:09588] [ 6] /usr/lib64/libglog.so.0(_ZN6google10LogMessage5FlushEv+0x9c)[0x7f5d832c5a5c] [172-16-30-14:09588] [ 7] /usr/lib64/libglog.so.0(_ZN6google15LogMessageFatalD2Ev+0xe)[0x7f5d832c863e] [172-16-30-14:09588] [ 8] /home/caffe/python/caffe/../../build/lib/libcaffe.so.1.0.0(_ZN5caffe9GPUMemory7Manager15update_dev_infoEi+0x19a)[0x7f5cf47514da]

i print the my_id(my_id = multiprocessing.current_process()._identity[0]), it is 0,1,2.....200+..., but i only have 8 gpus, i see the issue 198 and 160, but still have no idea, could you tell me ? Thanks very much.

liuxiao214 commented 5 years ago

i have solved it.