Closed WoodsGao closed 5 years ago
torch.utils.cpp_extension.load
is the function to compile the C++/CUDA code. With the provided information, I cannot see what is the problem. Do you have any CPU load when the program stucks? Maybe you can test other PyTorch code with dynamic compilation. Or you could comment out https://github.com/zhou13/neurvps/blob/e2f8d19114dd6785ebfa95dbfb11d34ede6c908e/neurvps/models/deformable.py#L21 to see if you could get more warnings.
Feel free to reopen this issue if you have more clues and updates.
hello I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.
hello I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.
did you fix the issue? I have the same problem. Thanks
@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.
If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.
Hope this helps!
I remove the .cache directory. But the same issue occurs.
Exact the same issue as yashnsn and Agrechka, Thank you so much @KellyYutongHe
@KellyYutongHe you're a hero
@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.
If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.
Hope this helps!
@KellyYutongHe Thank you so much !! You saved my lot of time.
@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.
If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.
Hope this helps!
Thanks for the great answer. Also, for those who have difficulties finding what the "something" is in the "~/.cache/torch_extensions/something". I found it useful to evaluate the expression "os.path.join(build_directory, 'lock')" in some remote debug session (I use Pycharm remote debugging) and you will get what you want. For me, the "something" happens to be the "spmm_0". Therefore, after "rm -rf ~/.cache/torch_extensions/spmm_0", the bug is fixed.
@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.
If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.
Hope this helps!
It works!
CUDA VERSION:9.0 Python VERSION:3.6.8 Pytorch VERSION:1.2.0
I downloaded the tmm17 dataset and pre-trained model from Google Drive and used the command
to evaluate the tmm17 dataset, but after outputting
, the program has no other output. When I interrupt the program, I can see the program stucked in the "torch.utils.cpp_extension.load" function. Is there any problem with this operation?
This is the complete output: