ADer (https://arxiv.org/abs/2406.03262) is an open source visual anomaly detection toolbox based on PyTorch, which supports multiple popular AD datasets and approaches.
I made my data as exactlysame as format of mvtec. And then did modification on some codes to follow training my new data.
But after a bit of training, it is stuck and never processing forward like below showing 2/3. No more running GPU but GPU memory is still taken. Can I get some help out of this?
The whole thing after running the training code is shown as below
`
CUDA_VISIBLE_DEVICES=0 python run.py -c configs/vitad/vitad_cj.py -m train
08/30 10:46:56 AM - ==> Logging on master GPU: 0
08/30 10:46:56 AM - ==> Running Trainer: ViTADTrainer
08/30 10:46:56 AM - ==> Using GPU: [0] for Training
08/30 10:46:56 AM - ==> Building model
08/30 10:46:56 AM - Loading pretrained weights from Hugging Face hub (timm/vit_small_patch16_224.dino)
08/30 10:46:57 AM - [timm/vit_small_patch16_224.dino] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
08/30 10:46:57 AM - Resized position embedding: (14, 14) to (16, 16).
08/30 10:46:57 AM -
------------------------------------ ViTAD ------------------------------------
I made my data as exactlysame as format of mvtec. And then did modification on some codes to follow training my new data.
But after a bit of training, it is stuck and never processing forward like below showing 2/3. No more running GPU but GPU memory is still taken. Can I get some help out of this?
The whole thing after running the training code is shown as below
08/30 10:46:57 AM - ==> Creating optimizer 08/30 10:46:57 AM - ==> Loading dataset: DefaultAD 08/30 10:46:57 AM - ==> ** cfg ** fvcore_is : True
fvcore_b : 1
fvcore_c : 3
epoch_full : 100
metrics : ['mAUROC_sp_max', 'mAP_sp_max', 'mF1_max_sp_max', 'mAUPRO_px', 'mAUROC_px', 'mAP_px', 'mF1_max_px', 'mF1_px_0.2_0.8_0.1', 'mAcc_px_0.2_0.8_0.1', 'mIoU_px_0.2_0.8_0.1', 'mIoU_max_px'] use_adeval : True
evaluator.kwargs : {'metrics': ['mAUROC_sp_max', 'mAP_sp_max', 'mF1_max_sp_max', 'mAUPRO_px', 'mAUROC_px', 'mAP_px', 'mF1_max_px', 'mF1_px_0.2_0.8_0.1', 'mAcc_px_0.2_0.8_0.1', 'mIoU_px_0.2_0.8_0.1', 'mIoU_max_px'], 'pooling_ks': [16, 16], 'max_step_aupro': 100} vis : False
vis_dir : None
optim.lr : 0.0004
optim.kwargs : {'name': 'adamw', 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0.0001, 'amsgrad': False} trainer.name : ViTADTrainer
trainer.checkpoint : runs
trainer.logdir_sub :
trainer.resume_dir :
trainer.cuda_deterministic : False
trainer.epoch_full : 100
trainer.scheduler_kwargs : {'name': 'step', 'lr_noise': None, 'noise_pct': 0.67, 'noise_std': 1.0, 'noise_seed': 42, 'lr_min': 4e-06, 'warmup_lr': 4.0000000000000003e-07, 'warmup_iters': -1, 'cooldown_iters': 0, 'warmup_epochs': 0, 'cooldown_epochs': 0, 'use_iters': True, 'patience_iters': 0, 'patience_epochs': 0, 'decay_iters': 0, 'decay_epochs': 80, 'cycle_decay': 0.1, 'decay_rate': 0.1} trainer.mixup_kwargs : {'mixup_alpha': 0.8, 'cutmix_alpha': 1.0, 'cutmix_minmax': None, 'prob': 0.0, 'switch_prob': 0.5, 'mode': 'batch', 'correct_lam': True, 'label_smoothing': 0.1} trainer.test_start_epoch : 100
trainer.test_per_epoch : 10
trainer.find_unused_parameters : False
trainer.sync_BN : apex
trainer.dist_BN :
trainer.scaler : none
trainer.data.batch_size : 32
trainer.data.batch_size_per_gpu : 32
trainer.data.batch_size_test : 32
trainer.data.batch_size_per_gpu_test : 32
trainer.data.num_workers_per_gpu : 4
trainer.data.drop_last : True
trainer.data.pin_memory : True
trainer.data.persistent_workers : False
trainer.data.num_workers : 4
trainer.iter : 0
trainer.epoch : 0
trainer.iter_full : 1400
trainer.metric_recorder : {'mAUROC_sp_max_pizza': [], 'mAP_sp_max_pizza': [], 'mF1_max_sp_max_pizza': [], 'mAUPRO_px_pizza': [], 'mAUROC_px_pizza': [], 'mAP_px_pizza': [], 'mF1_max_px_pizza': [], 'mF1_px_0.2_0.8_0.1_pizza': [], 'mAcc_px_0.2_0.8_0.1_pizza': [], 'mIoU_px_0.2_0.8_0.1_pizza': [], 'mIoU_max_px_pizza': []} loss.loss_terms : [{'type': 'CosLoss', 'name': 'cos', 'avg': False, 'lam': 1.0}] loss.clip_grad : 5.0
loss.create_graph : False
loss.retain_graph : False
adv : False
logging.log_terms_train : [{'name': 'batch_t', 'fmt': ':>5.3f', 'add_name': 'avg'}, {'name': 'data_t', 'fmt': ':>5.3f'}, {'name': 'optim_t', 'fmt': ':>5.3f'}, {'name': 'lr', 'fmt': ':>7.6f'}, {'name': 'cos', 'suffixes': [''], 'fmt': ':>5.3f', 'add_name': 'avg'}] logging.log_terms_test : [{'name': 'batch_t', 'fmt': ':>5.3f', 'add_name': 'avg'}, {'name': 'cos', 'suffixes': [''], 'fmt': ':>5.3f', 'add_name': 'avg'}] logging.train_reset_log_per : 50
logging.train_log_per : 50
logging.test_log_per : 50
data.sampler : naive
data.loader_type : pil
data.loader_type_target : pil_L
data.type : DefaultAD
data.root : data/cj
data.meta : meta.json
data.cls_names : []
data.train_transforms : [{'type': 'Resize', 'size': (256, 256), 'interpolation': <InterpolationMode.BILINEAR: 'bilinear'>}, {'type': 'CenterCrop', 'size': (256, 256)}, {'type': 'ToTensor'}, {'type': 'Normalize', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'inplace': True}] data.test_transforms : [{'type': 'Resize', 'size': (256, 256), 'interpolation': <InterpolationMode.BILINEAR: 'bilinear'>}, {'type': 'CenterCrop', 'size': (256, 256)}, {'type': 'ToTensor'}, {'type': 'Normalize', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'inplace': True}] data.target_transforms : [{'type': 'Resize', 'size': (256, 256), 'interpolation': <InterpolationMode.BILINEAR: 'bilinear'>}, {'type': 'CenterCrop', 'size': (256, 256)}, {'type': 'ToTensor'}] data.train_size : 14
data.test_size : 3
data.train_length : 477
data.test_length : 93
model_t.name : vit_small_patch16_224_dino
model_t.kwargs : {'pretrained': True, 'checkpoint_path': '', 'pretrained_strict': False, 'strict': True, 'img_size': 256, 'teachers': [3, 6, 9], 'neck': [12]} model_f.name : fusion
model_f.kwargs : {'pretrained': False, 'checkpoint_path': '', 'strict': False, 'dim': 384, 'mul': 1} model_s.name : de_vit_small_patch16_224_dino
model_s.kwargs : {'pretrained': False, 'checkpoint_path': '', 'strict': False, 'img_size': 256, 'students': [3, 6, 9], 'depth': 9} model.name : vitad
model.kwargs : {'pretrained': False, 'checkpoint_path': '', 'strict': True, 'model_t': Namespace(name='vit_small_patch16_224_dino', kwargs={'pretrained': True, 'checkpoint_path': '', 'pretrained_strict': False, 'strict': True, 'img_size': 256, 'teachers': [3, 6, 9], 'neck': [12]}), 'model_f': Namespace(name='fusion', kwargs={'pretrained': False, 'checkpoint_path': '', 'strict': False, 'dim': 384, 'mul': 1}), 'model_s': Namespace(name='de_vit_small_patch16_224_dino', kwargs={'pretrained': False, 'checkpoint_path': '', 'strict': False, 'img_size': 256, 'students': [3, 6, 9], 'depth': 9})} seed : 42
size : 256
warmup_epochs : 0
test_start_epoch : 100
test_per_epoch : 10
batch_train : 32
batch_test_per : 32
lr : 0.0004
weight_decay : 0.0001
cfg_path : configs.vitad.vitad_cj
mode : train
sleep : 0
memory : -1
dist_url : env://
logger_rank : 0
opts : []
command : python3 -m torch.distributed.launch --nproc_per_node=$nproc_per_node --nnodes=$nnodes --node_rank=$node_rank --master_addr=$master_addr --master_port=$master_port --use_env run.py -c configs.vitad.vitad_cj -m train --sleep 0 --memory -1 --dist_url env:// --logger_rank 0 task_start_time : 5769699.826605973
dist : False
world_size : 1
rank : 0
local_rank : 0
ngpus_per_node : 1
nnodes : 1
master : True
logdir : runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656 logger.filters : []
logger.name : root
logger.level : 20
logger.parent : None
logger.propagate : True
logger.disabled : False
logdir_train : runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656/show_train logdir_test : runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656/show_test 08/30 10:46:57 AM - ==> Starting training with 1 nodes x 1 GPUs 08/30 10:47:01 AM - ==> Total time: 0:00:04 Eta: 0:07:55 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:05 AM - ==> Total time: 0:00:09 Eta: 0:07:26 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:10 AM - ==> Total time: 0:00:13 Eta: 0:07:10 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:12 AM - Train: 3.57% [50/1400] [3.6/100.0] [batch_t 0.053 (0.265)] [data_t 0.002] [optim_t 0.051] [lr 0.000400] [cos 0.592 (0.631)] 08/30 10:47:14 AM - ==> Total time: 0:00:17 Eta: 0:07:01 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:18 AM - ==> Total time: 0:00:21 Eta: 0:06:57 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:22 AM - ==> Total time: 0:00:26 Eta: 0:06:48 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:27 AM - ==> Total time: 0:00:30 Eta: 0:06:44 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:28 AM - Train: 7.14% [100/1400] [7.1/100.0] [batch_t 0.053 (0.530)] [data_t 0.002] [optim_t 0.050] [lr 0.000400] [cos 0.348 (0.351)] 08/30 10:47:31 AM - ==> Total time: 0:00:34 Eta: 0:06:41 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 08/30 10:47:35 AM - ==> Total time: 0:00:39 Eta: 0:06:36 Logged in 'runs/ViTADTrainer_configs_vitad_vitad_cj_20240830-104656' 2/3 `
Thank you very much.