Error code 12 (Cannot allocate memory)

joeyclancy commented 7 months ago

Hello Ziye: I tried to run the latest version of COMEBin with CPU/GPU and there is something wrong happened:

  100%|██████████| 1665/1665 [10:54<00:00,  2.54it/s]
  100%|██████████| 1665/1665 [10:55<00:00,  2.54it/s]
  100%|██████████| 1665/1665 [10:45<00:00,  2.58it/s]
  100%|██████████| 1665/1665 [11:01<00:00,  2.52it/s]
   66%|██████▌   | 1096/1665 [07:14<03:45,  2.52it/s]
  Traceback (most recent call last):
    File "main.py", line 369, in <module>
      main()
    File "main.py", line 289, in main
      train_CLmodel(logger,args)
    File "/public/home/hymeta/anaconda3/envs/comebin/bin/COMEBin/train_CLmodel.py", line 236, in train_CLmodel
      simclr.train_addpretrain(train_loader, dataset, namelist)
    File "/public/home/hymeta/anaconda3/envs/comebin/bin/COMEBin/simclr.py", line 283, in train_addpretrain
      logits, labels = self.info_nce_loss(features)
    File "/public/home/hymeta/anaconda3/envs/comebin/bin/COMEBin/simclr.py", line 42, in info_nce_loss
      labels = (labels.unsqueeze(0) == labels.unsqueeze(1)).float()
  RuntimeError: [enforce fail at CPUAllocator.cpp:68] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 150994944 bytes. Error code 12 (Cannot allocate memory)
  Something went wrong with running training network. Exiting.

By the way, the memory in our HPC node is 512GB(CPU) and 64GB(GPU) I am confused. Thanks in advance

Best wishes Joey

ziyewang commented 7 months ago

Hi,

Could you please rerun it? Could it be that other processes on the node are consuming memory, affecting the execution of COMEBin?

Best, Ziye

joeyclancy commented 7 months ago

Hi Ziye: Thanks for your advice. I had submitted four missions to different nodes before, and only COMEBin was executed on these nodes. However, all of the missions failed with 'Error code 12'.

ziyewang / COMEBin

Error code 12 (Cannot allocate memory) #18