I am running train with the visit datasets on a supercomputer GPU node. The code is unable to proceed from rank: 0:
/people/$USER/.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:34: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
"lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/people/$USER//.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:92: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/people/$USER//.conda/envs/arldm/lib/python3.8/site-packages/pl_bolts/losses/self_supervised_learning.py:228: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
self.nce_loss = AmdimNCELoss(tclip)
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:19369 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [dlt04.local]:19369 (errno: 97 - Address family not supported by protocol).
I noticed that the error potentially is related to this DDP issue. And I've tried all the mentioned solutions from this thread but none solve my problem.
So My question is, are there any other ways to run the code differently to bypass this issue?
I'd like to run this program ideally on GPUs. CPU is also fine but is there any adjustment so I can reduce the computation time?
I am running train with the visit datasets on a supercomputer GPU node. The code is unable to proceed from
rank: 0
:I noticed that the error potentially is related to this DDP issue. And I've tried all the mentioned solutions from this thread but none solve my problem.
So My question is, are there any other ways to run the code differently to bypass this issue? I'd like to run this program ideally on GPUs. CPU is also fine but is there any adjustment so I can reduce the computation time?
My config: https://github.com/candiceT233/ARLDM/blob/main/config.yaml
My conda environment packages:
Others:
Sbatch command: