Closed pnjha closed 4 years ago
It seems like there is something wrong with your PLT building and the training didn't start. You can see that no process used your GPUs. I just cloned the code and ran it on a 4 GPUs machine like yours, it works well like below:
I debugged and realized that the code is getting stuck in deepxml/tree.py if level == 0: while not os.path.exists('{}-Level-{}.npy'.format(self.groups_path,level)): time.sleep(30) Probably because file "models/FastAttentionXML-Amazon-670K-Tree-0-cluster" is not created. Can you think about any reason for that?
I debugged and realized that the code is getting stuck in deepxml/tree.py if level == 0: while not os.path.exists('{}-Level-{}.npy'.format(self.groups_path,level)): time.sleep(30) Probably because file "models/FastAttentionXML-Amazon-670K-Tree-0-cluster" is not created. Can you think about any reason for that?
This file is created by deepxml/cluster.py. I checked and ran the code from github more than one time. It works well. Maybe you could check the code and data in your machine, debug your PLT buliding in deepxml/cluster.py and run it again.
The entire problem was related to CPU resources allocated. It does run as expected when I increased CPU resources to 16 cores. Thanks for your help.
I ran the code for the Amazon-670k dataset. I have not made any changes in the code or configuration files. But it's taking more than 10hrs to train and still the training is not complete.
My GPU details are given below
Can you confirm the amount of time it takes is fine or not.