yuweijiang / HGL-pytorch

Code for the model "Heterogeneous Graph Learning for Visual Commonsense Reasoning (NeurlPS 2019)"
MIT License
46 stars 13 forks source link

i have problem about restore checkpoint! #2

Open jaeyun95 opened 4 years ago

jaeyun95 commented 4 years ago

hi! i have problem about restore checkpoint. It stopped learning, so I tried to restore but got an error. help! T^T

restore is True
Found folder! restoring
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    learning_rate_scheduler=scheduler)
  File "/home/ailab/HGL-pytorch/utils/pytorch_misc.py", line 226, in restore_checkpoint
    training_state = torch.load(training_state_path, map_location=device_mapping(-1))
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 549, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4859355 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
  what():  owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 ASSERT FAILED at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f3920592cf5 in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: THStorage_free + 0xca (0x7f38d72a68ea in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x12c11d (0x7f39208d011d in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xf0 (0x7f39266a8830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
yuweijiang commented 4 years ago

It seems that you loaded an uncompleted file. Could you check the saving path of your checkpoint to make sure whether the checkpoint is saved?

tuyunbin commented 4 years ago

Hi, I want to know how many GPU memories do you use for successfully running this code? @jaeyun95