weiliu89 / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
4.77k stars 1.68k forks source link

runtest gradient calculation tests fail, and training has adverse problems on P100 gpus #710

Open ghostcow opened 7 years ago

ghostcow commented 7 years ago

Issue summary

Many severe gradient calculation errors when running 'make -j runtest'. Each 10 iterations take close to ~1 hour, loss explodes to nan after 10 iterations.

This happens on machine with P100 gpus, but not on machine with Titan X gpus.

Steps to reproduce

follow tutorial (except use cmake to compile caffe), and after compilation run 'make test && make -j runtest' from $CAFFE_ROOT/build directory

Your system configuration

Operating system: Ubuntu 16.04 Compiler: CUDA version (if applicable): 8.0 CUDNN version (if applicable): 6.0 (also 5.1) BLAS: OpenBLAS Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7

ghostcow commented 7 years ago

I just noticed that Pascal architectures aren't exactly supported in that caffe release. I'm trying to merge the ssd branch with master but currently the tests are failing.

EDIT: will update once done