Open TonyTangYu opened 6 years ago
How long is it stuck there for? Do you see a percentage bar load?
@karansomaiah In fact, it is struck for a long time. I did't see the percentage bar. I have tried several times before. Today I check which part of code is wrong. I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code and the training process is running now. But I wander whether that would cause a terrible result.
@TonyTangYu That makes total sense. The interesting thing is, it works fine on my local computer but on the cluster, I face the same problem as you. Will keep you updated.
Update: It isn't giving me any errors. I had some issues with reading in the data for training in cluster but resolved it. Didn't have to comment out the respective lines.
@karansomaiah Thanks for your response and update. Could you please tell me what are your issues and how to fix them? Thank you! Perhaps I will have the same error as you. Perhaps there are issues with loading the data.
@TonyTangYu Definitely! So I was trying to read in my data directly from S3 bucket by changing the train_root_folder and pointing it to the S3 bucket but that led to it getting stuck. After getting the data locally, it started to run. Hope this helps.
@karansomaiah Thank you very much! You help me a lot!
@karansomaiah Use tf.global_variables_initializer
instead.
/home/liuji/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Restoring parameters from /home/liuji/light_head_rcnn/data/imagenet_weights/res101.ckpt
^CTraceback (most recent call last):
File "train.py", line 264, in
Hello, I also meet the same problem, could you give a more detail solution. Thanks!.
@karansomaiah Thank you very much! You help me a lot!
so, can you give a detail for the solution
I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is
Start data provider ipc://@dataflow-map-pipe-1c5a64ee-0 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-1 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-2 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-3 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-4 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-5 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-6 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-7 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-8 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-9 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-10 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-11 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-12 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-13 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-14 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-15 2018-11-18 19:57:23.729985: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-18 19:57:23.994393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-11-18 19:57:23.994853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 9.63GiB 2018-11-18 19:57:23.994897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use
tf.global_variables_initializer
instead. /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/tangyu/Desktop/light_head_rcnn/data/imagenet_weights/res101.ckptIt stucks here. I don't know what is wrong.How can I fix it?
i met the same problem, can you show a more detail solution?
@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.
@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.
however, i follow your advice, but new problem occur, it is ''raise StopIteration" def forward(self): """Get blobs and copy them into this layer's top blob vector.""" if self.iter_next(): return self.current_batch else: raise StopIteration
Having the problem with getting stuck forever at: "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from ~/light_head_rcnn/data/imagenet_weights/res101.ckpt The problem is with the threading function as noticed by many otehrs: light_head_rcnn/lib/utils/dpflow/prefetching_iter.py lines 83 and 69.
Did anyone resolve this? Commenting the e.wait() does not work at all, not the right solution.
Using TF 1.5.0 with a single GPU (GTX 1070) in a nvidia-docker container.
Hey guys, I ran into the same problem using custom data and I think I know what it the problem.
First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.
Hope it helped !
Hey guys, I ran into the same problem using custom data and I think I know what it the problem.
First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.
Hope it helped !
perfect answer!!! I solve this problem by checking the config.py, in this file the 'train_root_folder' not have "/" in the end, so the programmer can`t find the image. Thank you!!!!
Indeed, the image paths given in "fpath" is joined from behind in light_head_rcnn/experiments/lizming/lighthead[...]/dataset.py this way:
os.path.join(train_root_folder, record['fpath])
where train_root_folder is specified in light_head_rcnn/experiments/lizming/lighthead[...]/config.py:
train_root_folder = os.path.join(root_dir, 'data/MSCOCO')
Finally, root_dir also defined in config.py as:
root_dir = osp.abspath(osp.join(osp.dirname(__file__), '..', '..', '..'))
Which by default would give the root of the repository (with 'light_head_rcnn' the last folder in the path).
I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is
It stucks here. I don't know what is wrong.How can I fix it?