Stuck when running "python3 train.py"

TonyTangYu commented 6 years ago

I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is

Start data provider ipc://@dataflow-map-pipe-1c5a64ee-0 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-1 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-2 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-3 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-4 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-5 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-6 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-7 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-8 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-9 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-10 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-11 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-12 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-13 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-14 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-15 2018-11-18 19:57:23.729985: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-18 19:57:23.994393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-11-18 19:57:23.994853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 9.63GiB 2018-11-18 19:57:23.994897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/tangyu/Desktop/light_head_rcnn/data/imagenet_weights/res101.ckpt

It stucks here. I don't know what is wrong.How can I fix it?

karansomaiah commented 6 years ago

How long is it stuck there for? Do you see a percentage bar load?

TonyTangYu commented 6 years ago

@karansomaiah In fact, it is struck for a long time. I did't see the percentage bar. I have tried several times before. Today I check which part of code is wrong. I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code and the training process is running now. But I wander whether that would cause a terrible result.

karansomaiah commented 6 years ago

@TonyTangYu That makes total sense. The interesting thing is, it works fine on my local computer but on the cluster, I face the same problem as you. Will keep you updated.

Update: It isn't giving me any errors. I had some issues with reading in the data for training in cluster but resolved it. Didn't have to comment out the respective lines.

TonyTangYu commented 6 years ago

@karansomaiah Thanks for your response and update. Could you please tell me what are your issues and how to fix them? Thank you! Perhaps I will have the same error as you. Perhaps there are issues with loading the data.

karansomaiah commented 6 years ago

@TonyTangYu Definitely! So I was trying to read in my data directly from S3 bucket by changing the train_root_folder and pointing it to the S3 bucket but that led to it getting stuck. After getting the data locally, it started to run. Hope this helps.

TonyTangYu commented 6 years ago

@karansomaiah Thank you very much! You help me a lot!

lji72 commented 5 years ago

@karansomaiah Use tf.global_variables_initializer instead. /home/liuji/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/liuji/light_head_rcnn/data/imagenet_weights/res101.ckpt

^CTraceback (most recent call last): File "train.py", line 264, in train(args) File "train.py", line 186, in train blobs_list = prefetch_data_layer.forward() File "/home/liuji/light_head_rcnn/lib/utils/dpflow/prefetching_iter.py", line 78, in forward if self.iter_next(): File "/home/liuji/light_head_rcnn/lib/utils/dpflow/prefetching_iter.py", line 65, in iter_next e.wait() File "/home/liuji/anaconda3/envs/tensorflow/lib/python3.6/threading.py", line 551, in wait signaled = self._cond.wait(timeout) File "/home/liuji/anaconda3/envs/tensorflow/lib/python3.6/threading.py", line 295, in wait waiter.acquire()

Hello, I also meet the same problem, could you give a more detail solution. Thanks!.

chanajianyu commented 5 years ago

@karansomaiah Thank you very much! You help me a lot!

so, can you give a detail for the solution

chanajianyu commented 5 years ago

I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is

Start data provider ipc://@dataflow-map-pipe-1c5a64ee-0 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-1 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-2 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-3 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-4 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-5 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-6 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-7 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-8 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-9 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-10 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-11 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-12 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-13 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-14 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-15 2018-11-18 19:57:23.729985: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-18 19:57:23.994393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-11-18 19:57:23.994853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 9.63GiB 2018-11-18 19:57:23.994897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/tangyu/Desktop/light_head_rcnn/data/imagenet_weights/res101.ckpt

It stucks here. I don't know what is wrong.How can I fix it?

i met the same problem, can you show a more detail solution?

TonyTangYu commented 5 years ago

@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.

chanajianyu commented 5 years ago

@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.

however， i follow your advice, but new problem occur, it is ''raise StopIteration" def forward(self): """Get blobs and copy them into this layer's top blob vector.""" if self.iter_next(): return self.current_batch else: raise StopIteration

mbruchalski1 commented 5 years ago

Having the problem with getting stuck forever at: "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from ~/light_head_rcnn/data/imagenet_weights/res101.ckpt The problem is with the threading function as noticed by many otehrs: light_head_rcnn/lib/utils/dpflow/prefetching_iter.py lines 83 and 69.

Did anyone resolve this? Commenting the e.wait() does not work at all, not the right solution.

Using TF 1.5.0 with a single GPU (GTX 1070) in a nvidia-docker container.

YellowKyu commented 5 years ago

Hey guys, I ran into the same problem using custom data and I think I know what it the problem.

First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.

Hope it helped !

FeiWard commented 5 years ago

Hey guys, I ran into the same problem using custom data and I think I know what it the problem.

First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.

Hope it helped !

perfect answer！！！ I solve this problem by checking the config.py, in this file the 'train_root_folder' not have "/" in the end, so the programmer can`t find the image. Thank you!!!!

masotrix commented 5 years ago

Indeed, the image paths given in "fpath" is joined from behind in light_head_rcnn/experiments/lizming/lighthead[...]/dataset.py this way: os.path.join(train_root_folder, record['fpath]) where train_root_folder is specified in light_head_rcnn/experiments/lizming/lighthead[...]/config.py: train_root_folder = os.path.join(root_dir, 'data/MSCOCO') Finally, root_dir also defined in config.py as: root_dir = osp.abspath(osp.join(osp.dirname(__file__), '..', '..', '..')) Which by default would give the root of the repository (with 'light_head_rcnn' the last folder in the path).

zengarden / light_head_rcnn

Stuck when running "python3 train.py" #57