yangcong356 / BITA

This is the official code for "Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning"
24 stars 0 forks source link

设置distributed:为False,为什么还出现ValueError: Default process group has not been initialized, please make sure to call init_process_group. #2

Open MemoryOldTime opened 3 months ago

MemoryOldTime commented 3 months ago

配置文件: model: arch: bita_former model_type: pretrain_vitL load_pretrained: False # pretained from scratch freeze_vit: True

datasets: rsicd_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" nwpu_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" ucm_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption"

run: task: image_text_pretrain

optimizer

lr_sched: "linear_warmup_cosine_lr" init_lr: 1e-4 min_lr: 1e-5 warmup_lr: 1e-6

weight_decay: 0.05 max_epoch: 5 batch_size_train: 96 batch_size_eval: 64 num_workers: 0 warmup_steps: 5000

seed: 42 output_dir: "output/"

amp: True resume_ckpt_path: null

evaluate: False train_splits: ["train"]

device: "cuda:0" world_size: 1 dist_url: "env://" distributed: False

yangcong356 commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

  1. Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
  2. Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.

def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True
MemoryOldTime commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

  1. Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
  2. Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.

def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True

But I'm using a Windows system and there's no way to use the command CUDA_VISIBLE_DEVICES=0

MemoryOldTime commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

  1. Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
  2. Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.

def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True

The error is described as follows: Traceback (most recent call last): File "F:\Project\BITA\BITA\train.py", line 89, in main() File "F:\Project\BITA\BITA\train.py", line 85, in main runner.train() File "F:\Project\BITA\BITA\BITA\runners\runner_base.py", line 366, in train train_stats = self.train_epoch(cur_epoch) File "F:\Project\BITA\BITA\BITA\runners\runner_base.py", line 425, in train_epoch return self.task.train_epoch( File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 103, in train_epoch return self._train_inner_loop( File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 208, in _train_inner_loop loss = self.train_step(model=model, samples=samples) File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 57, in train_step loss = model(samples)["loss"] File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "F:\Project\BITA\BITA\BITA\models\bita\bita_ift.py", line 145, in forward rank = dist.get_rank() File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\distributed\distributed_c10d.py", line 1746, in get_rank default_pg = _get_default_group() File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\distributed\distributed_c10d.py", line 1008, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group.

yangcong356 commented 3 months ago

@MemoryOldTime It is recommended to run the code on Linux. If you want to run it on Windows, it is advisable to learn how to use PyTorch's Distributed communication package.