Unable to train? - Githubissues

Indigo6 commented 1 year ago

Traceback (most recent call last):
  File "tools/train.py", line 225, in <module>
    main()
  File "tools/train.py", line 182, in main
    train(cfg, train_loader, model, criterion, criterion_kd, consistency_weight, kd_weight, ens_weight,
  File "/home/OKDHP/tools/../lib/core/function_okd.py", line 59, in train
    loss.backward()
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 256, 64, 64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

zhengli97 commented 1 year ago

Sorry, I didn't meet this problem before. Can you provide more training details?

Indigo6 commented 1 year ago

Environment: CUDA 11.1, Pytorch 1.10.0, NVIDIA GeForce RTX 3090 GPU
Command: CUDA_VISIBLE_DEVICES=3 python tools/train.py --cfg experiments/mpii/hourglass/hg_template.yaml GPUS '(0,)' DATASET.COLOR_RGB False DATASET.DATASET 'mpii' DATASET.ROOT 'data/mpii' DATASET.NUM_JOINTS_HALF_BODY 8 DATASET.PROB_HALF_BODY -1.0 MODEL.NAME 'hourglass_okd_share_less' MODEL.EXP 'stack4_ens_weight_1_kd_2' MODEL.NUM_JOINTS 16 MODEL.INIT_WEIGHTS False MODEL.IMAGE_SIZE 256,256 MODEL.HEATMAP_SIZE 64,64 MODEL.SIGMA 2 MODEL.EXTRA.NUM_FEATURES 256 MODEL.EXTRA.NUM_STACKS 4 MODEL.EXTRA.NUM_BLOCKS 1 TRAIN.BATCH_SIZE_PER_GPU 16 TRAIN.KD_WEIGHT 2.0 TRAIN.ENS_WEIGHT 1.0 TRAIN.BEGIN_EPOCH 0 TRAIN.END_EPOCH 150 TEST.BATCH_SIZE_PER_GPU 16 DEBUG.DEBUG False

Full error log:

=> creating output/mpii/hourglass_okd_share_less/hg_template_stack4_ens_weight_1_kd_2
=> creating output/mpii/hourglass_okd_share_less/hg_template_stack4_ens_weight_1_kd_2/tensorboard_log
Namespace(cfg='experiments/mpii/hourglass/hg_template.yaml', dataDir='', logDir='', modelDir='', opts=['GPUS', '(0,)', 'DATASET.COLOR_RGB', 'False', 'DATASET.DATASET', 'mpii', 'DATASET.ROOT', 'your_data_directory', 'DATASET.NUM_JOINTS_HALF_BODY', '8', 'DATASET.PROB_HALF_BODY', '-1.0', 'MODEL.NAME', 'hourglass_okd_share_less', 'MODEL.EXP', 'stack4_ens_weight_1_kd_2', 'MODEL.NUM_JOINTS', '16', 'MODEL.INIT_WEIGHTS', 'False', 'MODEL.IMAGE_SIZE', '256,256', 'MODEL.HEATMAP_SIZE', '64,64', 'MODEL.SIGMA', '2', 'MODEL.EXTRA.NUM_FEATURES', '256', 'MODEL.EXTRA.NUM_STACKS', '4', 'MODEL.EXTRA.NUM_BLOCKS', '1', 'TRAIN.BATCH_SIZE_PER_GPU', '16', 'TRAIN.KD_WEIGHT', '2.0', 'TRAIN.ENS_WEIGHT', '1.0', 'TRAIN.BEGIN_EPOCH', '0', 'TRAIN.END_EPOCH', '150', 'TEST.BATCH_SIZE_PER_GPU', '16', 'DEBUG.DEBUG', 'False'], prevModelDir='')
AUTO_RESUME: True
CUDNN:
  BENCHMARK: True
  DETERMINISTIC: False
  ENABLED: True
DATASET:
  CACHE_ROOT: data/cache
  COLOR_RGB: False
  DATASET: mpii
  DATA_FORMAT: jpg
  FLIP: True
  HYBRID_JOINTS_TYPE: 
  NUM_JOINTS_HALF_BODY: 8
  PROB_HALF_BODY: -1.0
  ROOT: data/mpii/
  ROT_FACTOR: 30
  SCALE_FACTOR: 0.25
  SELECT_DATA: False
  TEST_SET: valid
  TRAIN_SET: train
DATA_DIR: 
DEBUG:
  DEBUG: False
  SAVE_BATCH_IMAGES_GT: True
  SAVE_BATCH_IMAGES_PRED: True
  SAVE_HEATMAPS_GT: True
  SAVE_HEATMAPS_PRED: True
GPUS: (0,)
KD:
  ALPHA: 0.5
  TEACHER: 
  TRAIN_TYPE: NORMAL
LOG_DIR: log
LOSS:
  TOPK: 8
  USE_DIFFERENT_JOINTS_WEIGHT: False
  USE_OHKM: False
  USE_TARGET_WEIGHT: True
MODEL:
  EXP: stack4_ens_weight_1_kd_2
  EXTRA:
    NUM_BLOCKS: 1
    NUM_BRANCH: 3
    NUM_FEATURES: 256
    NUM_STACKS: 4
  HEATMAP_SIZE: [64, 64]
  IMAGE_SIZE: [256, 256]
  INIT_WEIGHTS: False
  NAME: hourglass_okd_share_less
  NUM_JOINTS: 16
  PRETRAINED: models/pytorch/imagenet/resnet50-19c8e357.pth
  SIGMA: 2
  TAG_PER_JOINT: True
  TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
  BATCH_SIZE_PER_GPU: 16
  BBOX_THRE: 1.0
  COCO_BBOX_FILE: 
  FLIP_TEST: True
  IMAGE_THRE: 0.1
  IN_VIS_THRE: 0.0
  MODEL_FILE: 
  NMS_THRE: 0.6
  OKS_THRE: 0.5
  POST_PROCESS: True
  SHIFT_HEATMAP: True
  SOFT_NMS: False
  USE_GT_BBOX: False
TRAIN:
  BATCH_SIZE_PER_GPU: 16
  BEGIN_EPOCH: 0
  CHECKPOINT: 
  END_EPOCH: 150
  ENS_WEIGHT: 1.0
  GAMMA1: 0.99
  GAMMA2: 0.0
  KD_WEIGHT: 2.0
  LENGTH: 90
  LR: 0.00025
  LR_FACTOR: 0.1
  LR_STEP: [90, 120]
  MOMENTUM: 0.9
  NESTEROV: False
  OPTIMIZER: adam
  RESUME: False
  SHUFFLE: True
  WD: 0.0001
WORKERS: 8
test

Total Parameters: 30,915,696
----------------------------------------------------------------------------------------------------------------------------------
Total Multiply Adds (For Convolution and Linear Layers only): 46.38380718231201 GFLOPs
----------------------------------------------------------------------------------------------------------------------------------
Number of Layers
Conv2d : 377 layers   BatchNorm2d : 357 layers   ReLU : 429 layers   Bottleneck : 115 layers   MaxPool2d : 1 layers   Linear : 4 layers   Softmax : 1 layers   Upsample : 32 layers   Hourglass : 8 layers   
=> load 22246 samples
=> load 2958 samples
Current Epoch: 0
  0%|                                                                                                                                                             | 0/1391 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "tools/train.py", line 225, in <module>
    main()
  File "tools/train.py", line 182, in main
    train(cfg, train_loader, model, criterion, criterion_kd, consistency_weight, kd_weight, ens_weight,
  File "/home/OKDHP/tools/../lib/core/function_okd.py", line 59, in train
    loss.backward()
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 256, 64, 64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

zhengli97 commented 1 year ago

The training command looks fine. Do you have any modifications in the training code part?

Indigo6 commented 1 year ago

I added tools/_init_paths.py copied from FPD/HRNet, as the cloned codes failed with ModuleNotFoundError: No module named '_init_paths' initially.

zhengli97 commented 1 year ago

Sorry, this code was written two years ago. Now I don't have the environment to reproduce this error. You can try: 1. set torch.autograd.set_detect_anomaly(True). to localize the bug or 2. In hourglass_okd_share_less.py line 137 change nn.ReLU(inplace=False) to nn.ReLU(inplace=True). (I don't know if this gonna work)

renjie-liang commented 1 year ago

I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239

zhengli97 commented 1 year ago

I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239

Thanks for your reply! @Indigo6 You can try this.

zhengli97 / OKDHP

Unable to train? #6