MemoryOldTime commented 3 months ago

配置文件： model: arch: bita_former model_type: pretrain_vitL load_pretrained: False # pretained from scratch freeze_vit: True

datasets: rsicd_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" nwpu_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption" ucm_caption: vis_processor: train: name: "bita_image_train" image_size: 224 text_processor: train: name: "bita_caption"

run: task: image_text_pretrain

optimizer

lr_sched: "linear_warmup_cosine_lr" init_lr: 1e-4 min_lr: 1e-5 warmup_lr: 1e-6

weight_decay: 0.05 max_epoch: 5 batch_size_train: 96 batch_size_eval: 64 num_workers: 0 warmup_steps: 5000

seed: 42 output_dir: "output/"

amp: True resume_ckpt_path: null

evaluate: False train_splits: ["train"]

device: "cuda:0" world_size: 1 dist_url: "env://" distributed: False

yangcong356 commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".
Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.

def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True

MemoryOldTime commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".

Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.
def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True

But I'm using a Windows system and there's no way to use the command CUDA_VISIBLE_DEVICES=0

MemoryOldTime commented 3 months ago

If you want to run a single card, take pretrain_stage1 as an example:

Modify pretrain_stage1.sh to "CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path /your_config_setting_path".

Remove 'world_size: 1' from pretrain_stage1.yaml, change 'distributed' to False, and keep the rest unchanged.

For the reason behind this, please refer to lines 60-72 in /common/dist_utils.py.
def init_distributed_mode(args):
    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ["WORLD_SIZE"])
        args.gpu = int(os.environ["LOCAL_RANK"])
    elif "SLURM_PROCID" in os.environ:
        args.rank = int(os.environ["SLURM_PROCID"])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print("Not using distributed mode")
        args.distributed = False
        return

    args.distributed = True

The error is described as follows： Traceback (most recent call last): File "F:\Project\BITA\BITA\train.py", line 89, in main() File "F:\Project\BITA\BITA\train.py", line 85, in main runner.train() File "F:\Project\BITA\BITA\BITA\runners\runner_base.py", line 366, in train train_stats = self.train_epoch(cur_epoch) File "F:\Project\BITA\BITA\BITA\runners\runner_base.py", line 425, in train_epoch return self.task.train_epoch( File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 103, in train_epoch return self._train_inner_loop( File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 208, in _train_inner_loop loss = self.train_step(model=model, samples=samples) File "F:\Project\BITA\BITA\BITA\tasks\base_task.py", line 57, in train_step loss = model(samples)["loss"] File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "F:\Project\BITA\BITA\BITA\models\bita\bita_ift.py", line 145, in forward rank = dist.get_rank() File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\distributed\distributed_c10d.py", line 1746, in get_rank default_pg = _get_default_group() File "C:\Users\XD.conda\envs\bita\lib\site-packages\torch\distributed\distributed_c10d.py", line 1008, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group.

yangcong356 commented 3 months ago

@MemoryOldTime It is recommended to run the code on Linux. If you want to run it on Windows, it is advisable to learn how to use PyTorch's Distributed communication package.

yangcong356 / BITA

设置distributed:为False，为什么还出现ValueError: Default process group has not been initialized, please make sure to call init_process_group. #2

optimizer