Closed LPXTT closed 2 years ago
Is there any thing I should pay attention to? Because the result is only 70?
Could you please share your hardware settings in here? It would be nice if you can share the wandb page as I did in here.
Also, I'm really confused that how you get 0 in loss_unsup...
I used four V100 with 32G memory to train the model. I didn't use wandb before. There are some errors when I want to sync files or login. I saw the code created a wandb dir, which contains some log files. Can I send them to you if they are useful?
[PS-MT][WARNING] Training start, 总共 80 epochs
[PS-MT][CRITICAL] DGX: Off with [4 x v100]
[PS-MT][CRITICAL] GPUs: 4
[PS-MT][CRITICAL] Network Architecture: deeplabv3+, with ResNet 50 backbone
[PS-MT][CRITICAL] Current Labeled Example: 1323
[PS-MT][CRITICAL] Learning rate: other 0.01, and head is the SAME [world]
[PS-MT][INFO] Image: 512x512 based on 600x600
[PS-MT][INFO] Current batch: 64 [world]
[PS-MT][INFO] Current unsupervised loss function: semi_ce, with weight 0.0 and length 12
[PS-MT][INFO] Current config+args:
{'name': 'PS-MT(VOC12)', 'experim_name': 'r50', 'n_labeled_examples': 1323, 'ramp_up': 12, 'unsupervised_w': 0.0, 'ignore_index': 255, 'lr_scheduler': 'Poly', 'use_weak_lables': False, 'weakly_loss_w': 0.4, 'model': {'supervised': False, 'semi': True, 'resnet': 50, 'sup_loss': 'CE', 'un_loss': 'semi_ce', 'warm_up_epoch': 5}, 'optimizer': {'type': 'SGD', 'args': {'lr': 0.01, 'weight_decay': 0.0001, 'momentum': 0.9}}, 'train_supervised': {'data_dir': '/mnt/efs/lpx/research/dataset/pascalVOC12/', 'batch_size': 8, 'crop_size': 512, 'shuffle': True, 'base_size': 600, 'scale': True, 'augment': True, 'flip': True, 'rotate': False, 'split': 'train_supervised', 'num_workers': 8}, 'train_unsupervised': {'data_dir': '/mnt/efs/lpx/research/dataset/pascalVOC12/', 'weak_labels_output': 'pseudo_labels/result/pseudo_labels', 'batch_size': 8, 'crop_size': 512, 'shuffle': True, 'base_size': 600, 'scale': True, 'augment': True, 'flip': True, 'rotate': False, 'split': 'train_unsupervised', 'num_workers': 8}, 'val_loader': {'data_dir': '/mnt/efs/lpx/research/dataset/pascalVOC12/', 'batch_size': 1, 'val': True, 'split': 'val', 'shuffle': False, 'num_workers': 4}, 'trainer': {'epochs': 80, 'save_dir': 'saved/', 'save_period': 1, 'log_dir': 'saved/', 'log_per_iter': 20, 'val': True, 'val_per_epochs': 1, 'gamma': 0.5, 'sharp_temp': 0.5}, 'n_gpu': 4, 'nodes': 1, 'batch_size': 8, 'epochs': 80, 'warm_up': 5, 'labeled_examples': 1323, 'learning_rate': 0.0025, 'gpus': 4, 'gcloud': 0, 'local_rank': 0, 'architecture': 'deeplabv3+', 'backbone': 50, 'ddp': True, 'dgx': False, 'semi_p_th': 0.6, 'semi_n_th': 0.0, 'unsup_weight': 0.0, 'world_size': 4}
Load model, Time usage:
IO: 0.06483221054077148, initialize parameters: 1.532975673675537
Load model, Time usage:
IO: 0.0729672908782959, initialize parameters: 1.567183494567871
Load model, Time usage:
IO: 0.07271456718444824, initialize parameters: 1.5179588794708252
Load model, Time usage:
IO: 0.06479263305664062, initialize parameters: 1.520132064819336
W&B offline, running your script from this directory will only write metadata locally.
wandb: Tracking run with wandb version 0.12.21
wandb: W&B syncing is set to offline
in this directory.
wandb: Run wandb online
or set WANDB_MODE=online to enable cloud syncing.
[PS-MT][CRITICAL] distributed data parallel training: on
ID 1 Warm (0) | Ls 2.01 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:25<00:00, 1.61it/s]
ID 2 Warm (0) | Ls 1.93 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:18<00:00, 2.24it/s]
ID 3 Warm (0) | Ls 1.36 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:20<00:00, 2.00it/s]
ID 1 Warm (1) | Ls 1.53 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:23<00:00, 1.74it/s]
ID 2 Warm (1) | Ls 1.66 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:19<00:00, 2.13it/s]
ID 3 Warm (1) | Ls 0.87 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:23<00:00, 1.74it/s]
ID 1 Warm (2) | Ls 1.56 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:19<00:00, 2.10it/s]
ID 2 Warm (2) | Ls 1.62 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:21<00:00, 1.90it/s]
ID 3 Warm (2) | Ls 0.74 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:21<00:00, 1.91it/s]
ID 1 Warm (3) | Ls 1.56 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:19<00:00, 2.05it/s]
ID 2 Warm (3) | Ls 1.49 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:21<00:00, 1.90it/s]
ID 3 Warm (3) | Ls 0.56 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:19<00:00, 2.08it/s]
ID 1 Warm (4) | Ls 1.46 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:22<00:00, 1.86it/s]
ID 2 Warm (4) | Ls 1.49 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:25<00:00, 1.62it/s]
ID 3 Warm (4) | Ls 0.58 |: 100%|███████████████████████████████████████████████████████████████████████| 41/41 [00:18<00:00, 2.24it/s]
ID 1 T (1) | Ls 0.255 Lu 0.000 Lw 0.000 m1 0.724 m2 0.031|: 100%|████████████████████████████████████| 289/289 [14:03<00:00, 2.92s/it]
[PS-MT][INFO] evaluating ...
EVAL ID (Teachers) (1) | Loss: 1.3726, PixelAcc: 0.7327, Mean IoU: 0.0350 |: 100%|████████████| 1449/1449 [03:49<00:00, 6.33it/s]
I may find the problem after reading the log file. The weight of unsupervised loss is 0.
No worries at all! Please make it to 1.5 as default and re-run the approach.
May I kindly ask whether my script causes the bug or not?
Thank you for your help! Got it! Sorry about that. I forgot I changed the loss weight.
You are welcome. Please re-open the issue if your result cannot achieve the reported performance.
Hi, I got some models and results by running './scripts/train_voc_aug.sh -l 1323 -g 4 -b 50'. How can I get the testing results on Pascal VOC? Is the valid_Mean_IoU (0.7005) same as testing result? Run summary: global_step 23119 learning_rate_0 1e-05 learning_rate_1 1e-05 loss_sup 0.05151 loss_unsup 0.0 mIoU_labeled 0.932 mIoU_unlabeled 0.619 pixel_acc_labeled 0.98 pixel_acc_unlabeled 0.886 ramp_up 1.0 valid_Mean_IoU 0.7005 valid_Pixel_Accuracy 0.9316