runtest gradient calculation tests fail, and training has adverse problems on P100 gpus

weiliu89 / caffe

Caffe: a fast open framework for deep learning.

Other

4.77k stars 1.68k forks source link

Issue summary

Many severe gradient calculation errors when running 'make -j runtest'. Each 10 iterations take close to ~1 hour, loss explodes to nan after 10 iterations.

This happens on machine with P100 gpus, but not on machine with Titan X gpus.

Steps to reproduce

follow tutorial (except use cmake to compile caffe), and after compilation run 'make test && make -j runtest' from $CAFFE_ROOT/build directory

Your system configuration

Operating system: Ubuntu 16.04 Compiler: CUDA version (if applicable): 8.0 CUDNN version (if applicable): 6.0 (also 5.1) BLAS: OpenBLAS Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7

weiliu89 / caffe