zhengli97 / OKDHP

[ICCV 2021] Official PyTorch Code for "Online Knowledge Distillation for Efficient Pose Estimation"
https://zhengli97.github.io/OKDHP/
43 stars 4 forks source link

Unable to train? #6

Closed Indigo6 closed 1 year ago

Indigo6 commented 1 year ago
Traceback (most recent call last):
  File "tools/train.py", line 225, in <module>
    main()
  File "tools/train.py", line 182, in main
    train(cfg, train_loader, model, criterion, criterion_kd, consistency_weight, kd_weight, ens_weight,
  File "/home/OKDHP/tools/../lib/core/function_okd.py", line 59, in train
    loss.backward()
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 256, 64, 64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
zhengli97 commented 1 year ago

Sorry, I didn't meet this problem before. Can you provide more training details?

Indigo6 commented 1 year ago
zhengli97 commented 1 year ago

The training command looks fine. Do you have any modifications in the training code part?

Indigo6 commented 1 year ago

I added tools/_init_paths.py copied from FPD/HRNet, as the cloned codes failed with ModuleNotFoundError: No module named '_init_paths' initially.

zhengli97 commented 1 year ago

Sorry, this code was written two years ago. Now I don't have the environment to reproduce this error. You can try: 1. set torch.autograd.set_detect_anomaly(True). to localize the bug or 2. In hourglass_okd_share_less.py line 137 change nn.ReLU(inplace=False) to nn.ReLU(inplace=True). (I don't know if this gonna work)

renjie-liang commented 1 year ago

I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239

zhengli97 commented 1 year ago

I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239

Thanks for your reply! @Indigo6 You can try this.