Closed Yan1026 closed 2 years ago
I fix it.Compared the code of VOC-main and Cityscape-main,I found that Cityscape-main lack a line of code about DDP.
args.ddp = True if args.gpus > 1 else False
After adding the code, the model can be trained.Maybe your code is a test version or I made a mistake.
Glad to hear you solve it. In our experiments, I added the flag "--ddp" manually, and I missed this line when I re-organize the code.
Thanks a lot for reporting it.
Hi @yyliu01 ,I train withbash ./scripts/train_city.sh -l 372 -g 4 -b 50
,but get error:
Saving a checkpoint: saved/final_test/372_mIoU_0.6137_model_e10.pth ...
EVAL ID (Model 1) (10) | PixelAcc: 0.9311, Mean IoU: 0.6137 |
Traceback (most recent call last):
File "CityCode/main.py", line 203, in <module>
mp.spawn(main, nprocs=config['n_gpu'], args=(config['n_gpu'], config, args))
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ PS-MT/CityCode/main.py", line 120, in main
trainer.train()
File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 171, in train
self._save_checkpoint(epoch)
File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 191, in _save_checkpoint
upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
NameError: name 'upload_checkpoint' is not defined
About CityCode/Base/base_trainer.py, line 187---194 ,I found the following code annotated in VOC, but not in Cityscape.
Do you have any ideas?Maybe it is a test version of the code?
pvc_dir = os.path.join("yy", "exercise_1", self.args.architecture,
"resnet{}_ckpt".format(str(self.args.backbone)), "city_cvpr_final",
str(self.args.labeled_examples))
upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
self.logger.info("Uploading current ckpt: mIoU_{}_model.pth to {}".format(str(state['monitor_best']),
Sorry to bother you. I train with
bash ./scripts/train_city.sh -l 372 -g 4 -b 50
,but get error:I try to fix it but no effect.I want use GPU5,6,7,8,because GPU0123 is occupied.But when print availble_gpus,it's still [0, 1, 2, 3]. I can train model on VOC in same case. Do you have any ideas?