Closed Indigo6 closed 1 year ago
Sorry, I didn't meet this problem before. Can you provide more training details?
CUDA_VISIBLE_DEVICES=3 python tools/train.py --cfg experiments/mpii/hourglass/hg_template.yaml GPUS '(0,)' DATASET.COLOR_RGB False DATASET.DATASET 'mpii' DATASET.ROOT 'data/mpii' DATASET.NUM_JOINTS_HALF_BODY 8 DATASET.PROB_HALF_BODY -1.0 MODEL.NAME 'hourglass_okd_share_less' MODEL.EXP 'stack4_ens_weight_1_kd_2' MODEL.NUM_JOINTS 16 MODEL.INIT_WEIGHTS False MODEL.IMAGE_SIZE 256,256 MODEL.HEATMAP_SIZE 64,64 MODEL.SIGMA 2 MODEL.EXTRA.NUM_FEATURES 256 MODEL.EXTRA.NUM_STACKS 4 MODEL.EXTRA.NUM_BLOCKS 1 TRAIN.BATCH_SIZE_PER_GPU 16 TRAIN.KD_WEIGHT 2.0 TRAIN.ENS_WEIGHT 1.0 TRAIN.BEGIN_EPOCH 0 TRAIN.END_EPOCH 150 TEST.BATCH_SIZE_PER_GPU 16 DEBUG.DEBUG False
Full error log:
=> creating output/mpii/hourglass_okd_share_less/hg_template_stack4_ens_weight_1_kd_2
=> creating output/mpii/hourglass_okd_share_less/hg_template_stack4_ens_weight_1_kd_2/tensorboard_log
Namespace(cfg='experiments/mpii/hourglass/hg_template.yaml', dataDir='', logDir='', modelDir='', opts=['GPUS', '(0,)', 'DATASET.COLOR_RGB', 'False', 'DATASET.DATASET', 'mpii', 'DATASET.ROOT', 'your_data_directory', 'DATASET.NUM_JOINTS_HALF_BODY', '8', 'DATASET.PROB_HALF_BODY', '-1.0', 'MODEL.NAME', 'hourglass_okd_share_less', 'MODEL.EXP', 'stack4_ens_weight_1_kd_2', 'MODEL.NUM_JOINTS', '16', 'MODEL.INIT_WEIGHTS', 'False', 'MODEL.IMAGE_SIZE', '256,256', 'MODEL.HEATMAP_SIZE', '64,64', 'MODEL.SIGMA', '2', 'MODEL.EXTRA.NUM_FEATURES', '256', 'MODEL.EXTRA.NUM_STACKS', '4', 'MODEL.EXTRA.NUM_BLOCKS', '1', 'TRAIN.BATCH_SIZE_PER_GPU', '16', 'TRAIN.KD_WEIGHT', '2.0', 'TRAIN.ENS_WEIGHT', '1.0', 'TRAIN.BEGIN_EPOCH', '0', 'TRAIN.END_EPOCH', '150', 'TEST.BATCH_SIZE_PER_GPU', '16', 'DEBUG.DEBUG', 'False'], prevModelDir='')
AUTO_RESUME: True
CUDNN:
BENCHMARK: True
DETERMINISTIC: False
ENABLED: True
DATASET:
CACHE_ROOT: data/cache
COLOR_RGB: False
DATASET: mpii
DATA_FORMAT: jpg
FLIP: True
HYBRID_JOINTS_TYPE:
NUM_JOINTS_HALF_BODY: 8
PROB_HALF_BODY: -1.0
ROOT: data/mpii/
ROT_FACTOR: 30
SCALE_FACTOR: 0.25
SELECT_DATA: False
TEST_SET: valid
TRAIN_SET: train
DATA_DIR:
DEBUG:
DEBUG: False
SAVE_BATCH_IMAGES_GT: True
SAVE_BATCH_IMAGES_PRED: True
SAVE_HEATMAPS_GT: True
SAVE_HEATMAPS_PRED: True
GPUS: (0,)
KD:
ALPHA: 0.5
TEACHER:
TRAIN_TYPE: NORMAL
LOG_DIR: log
LOSS:
TOPK: 8
USE_DIFFERENT_JOINTS_WEIGHT: False
USE_OHKM: False
USE_TARGET_WEIGHT: True
MODEL:
EXP: stack4_ens_weight_1_kd_2
EXTRA:
NUM_BLOCKS: 1
NUM_BRANCH: 3
NUM_FEATURES: 256
NUM_STACKS: 4
HEATMAP_SIZE: [64, 64]
IMAGE_SIZE: [256, 256]
INIT_WEIGHTS: False
NAME: hourglass_okd_share_less
NUM_JOINTS: 16
PRETRAINED: models/pytorch/imagenet/resnet50-19c8e357.pth
SIGMA: 2
TAG_PER_JOINT: True
TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
BATCH_SIZE_PER_GPU: 16
BBOX_THRE: 1.0
COCO_BBOX_FILE:
FLIP_TEST: True
IMAGE_THRE: 0.1
IN_VIS_THRE: 0.0
MODEL_FILE:
NMS_THRE: 0.6
OKS_THRE: 0.5
POST_PROCESS: True
SHIFT_HEATMAP: True
SOFT_NMS: False
USE_GT_BBOX: False
TRAIN:
BATCH_SIZE_PER_GPU: 16
BEGIN_EPOCH: 0
CHECKPOINT:
END_EPOCH: 150
ENS_WEIGHT: 1.0
GAMMA1: 0.99
GAMMA2: 0.0
KD_WEIGHT: 2.0
LENGTH: 90
LR: 0.00025
LR_FACTOR: 0.1
LR_STEP: [90, 120]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 8
test
Total Parameters: 30,915,696
----------------------------------------------------------------------------------------------------------------------------------
Total Multiply Adds (For Convolution and Linear Layers only): 46.38380718231201 GFLOPs
----------------------------------------------------------------------------------------------------------------------------------
Number of Layers
Conv2d : 377 layers BatchNorm2d : 357 layers ReLU : 429 layers Bottleneck : 115 layers MaxPool2d : 1 layers Linear : 4 layers Softmax : 1 layers Upsample : 32 layers Hourglass : 8 layers
=> load 22246 samples
=> load 2958 samples
Current Epoch: 0
0%| | 0/1391 [00:03<?, ?it/s]
Traceback (most recent call last):
File "tools/train.py", line 225, in <module>
main()
File "tools/train.py", line 182, in main
train(cfg, train_loader, model, criterion, criterion_kd, consistency_weight, kd_weight, ens_weight,
File "/home/OKDHP/tools/../lib/core/function_okd.py", line 59, in train
loss.backward()
File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/miniconda3/envs/mm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 256, 64, 64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
The training command looks fine. Do you have any modifications in the training code part?
I added tools/_init_paths.py
copied from FPD/HRNet, as the cloned codes failed with ModuleNotFoundError: No module named '_init_paths'
initially.
Sorry, this code was written two years ago. Now I don't have the environment to reproduce this error. You can try: 1. set torch.autograd.set_detect_anomaly(True).
to localize the bug or 2. In hourglass_okd_share_less.py line 137
change nn.ReLU(inplace=False)
to nn.ReLU(inplace=True)
. (I don't know if this gonna work)
I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239
I solve the bug: The "unsqueeze_" need be instead by "unsqueeze" in hourglass_okd_share_less.py:223 hourglass_okd.py:239
Thanks for your reply! @Indigo6 You can try this.