xichenpan / ARLDM

Official Pytorch Implementation of Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
https://arxiv.org/abs/2211.10950
MIT License
191 stars 29 forks source link

Training Cannot Start #24

Closed candiceT233 closed 1 year ago

candiceT233 commented 1 year ago

I am running train with the visit datasets on a supercomputer GPU node. The code is unable to proceed from rank: 0:

/people/$USER/.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:34: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  "lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/people/$USER//.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:92: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/people/$USER//.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/losses/self_supervised_learning.py:228: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  self.nce_loss = AmdimNCELoss(tclip)
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:19369 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [dlt04.local]:19369 (errno: 97 - Address family not supported by protocol).

I noticed that the error potentially is related to this DDP issue. And I've tried all the mentioned solutions from this thread but none solve my problem.

So My question is, are there any other ways to run the code differently to bypass this issue? I'd like to run this program ideally on GPUs. CPU is also fine but is there any adjustment so I can reduce the computation time?


My config: https://github.com/candiceT233/ARLDM/blob/main/config.yaml

My conda environment packages:

accelerate==0.20.3
diffusers==0.7.2
ftfy==6.1.1
hydra-core==1.3.2
lightning-bolts==0.7.0
pytorch-lightning==1.9.5
timm==0.5.4
torch==2.0.1
torchaudio==2.0.2
torchmetrics==1.1.0
torchvision==0.15.2
transformers==4.24.0

Others:

Python 3.8.17
CentOS7
3.10.0-1127.18.2.el7.x86_64
GPU : 8 x RTX 2080 Ti GPUs 384GB memory

Sbatch command:

python main.py &> "$SCRIPT_DIR/$JOB_NAME.log"
candiceT233 commented 1 year ago

Solved with using the correct srun command:

srun --ntasks=$SLURM_NTASKS python main.py