Open MemoryOldTime opened 3 months ago
If you want to run a single card, take pretrain_stage1 as an example:
For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.
def init_distributed_mode(args):
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ["WORLD_SIZE"])
args.gpu = int(os.environ["LOCAL_RANK"])
elif "SLURM_PROCID" in os.environ:
args.rank = int(os.environ["SLURM_PROCID"])
args.gpu = args.rank % torch.cuda.device_count()
else:
print("Not using distributed mode")
args.distributed = False
return
args.distributed = True
If you want to run a single card, take pretrain_stage1 as an example:
- Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
- Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.
For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.
def init_distributed_mode(args): if "RANK" in os.environ and "WORLD_SIZE" in os.environ: args.rank = int(os.environ["RANK"]) args.world_size = int(os.environ["WORLD_SIZE"]) args.gpu = int(os.environ["LOCAL_RANK"]) elif "SLURM_PROCID" in os.environ: args.rank = int(os.environ["SLURM_PROCID"]) args.gpu = args.rank % torch.cuda.device_count() else: print("Not using distributed mode") args.distributed = False return args.distributed = True
But I'm using a Windows system and there's no way to use the command CUDA_VISIBLE_DEVICES=0
If you want to run a single card, take pretrain_stage1 as an example:
- Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
- Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.
For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.
def init_distributed_mode(args): if "RANK" in os.environ and "WORLD_SIZE" in os.environ: args.rank = int(os.environ["RANK"]) args.world_size = int(os.environ["WORLD_SIZE"]) args.gpu = int(os.environ["LOCAL_RANK"]) elif "SLURM_PROCID" in os.environ: args.rank = int(os.environ["SLURM_PROCID"]) args.gpu = args.rank % torch.cuda.device_count() else: print("Not using distributed mode") args.distributed = False return args.distributed = True
The error is described as follows:
Traceback (most recent call last):
File "F:\Project\BITA\BITA\train.py", line 89, in
@MemoryOldTime It is recommended to run the code on Linux. If you want to run it on Windows, it is advisable to learn how to use PyTorch's Distributed communication package.
配置文件: model: arch: bita_former model_type: pretrain_vitL load_pretrained: False # pretained from scratch freeze_vit: True
datasets: rsicd_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" nwpu_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" ucm_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption"
run: task: image_text_pretrain
optimizer
lr_sched: "linear_warmup_cosine_lr" init_lr: 1e-4 min_lr: 1e-5 warmup_lr: 1e-6
weight_decay: 0.05 max_epoch: 5 batch_size_train: 96 batch_size_eval: 64 num_workers: 0 warmup_steps: 5000
seed: 42 output_dir: "output/"
amp: True resume_ckpt_path: null
evaluate: False train_splits: ["train"]
device: "cuda:0" world_size: 1 dist_url: "env://" distributed: False