Training Error on Amazon-670k

celsofranssa commented 1 year ago

Hello,

After hours of training in Amazon-670k, I am getting the following error:

[I 230302 08:33:14 tree:145] Finish Training Level-1
[I 230302 08:33:14 tree:149] Generating Candidates for Level-2, Number of Labels: 16384, Top: 160
^MCandidates:   0%|          | 0/459301 [00:00<?, ?it/s]^MCandidates:   1%|          | 2517/459301 [00:00<00:18, 25168.82it/s]^MCandidates:   1%|          | 5013/459301 [00:00>
^MParents: 0it [00:00, ?it/s]^MParents: 3155it [00:00, 31545.29it/s]^MParents: 6310it [00:00, 31546.17it/s]^MParents: 9409it [00:00, 31375.56it/s]^MParents: 12587it [00:00, 31>
  File "main.py", line 98, in <module>
    main()
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 70, in main
    model.train(train_x, train_y, valid_x, valid_y, mlb)
  File "/home/celso/projects/AttentionXML/deepxml/tree.py", line 200, in train
    self.train_level(self.level - 1, train_x, train_y, valid_x, valid_y)
  File "/home/celso/projects/AttentionXML/deepxml/tree.py", line 86, in train_level
    train_group_y, train_group, valid_group = self.train_level(level - 1, train_x, train_y, valid_x, valid_y)
  File "/home/celso/projects/AttentionXML/deepxml/tree.py", line 132, in train_level
    model = XMLModel(network=FastAttentionRNN, labels_num=labels_num, emb_init=self.emb_init,
  File "/home/celso/projects/AttentionXML/deepxml/models.py", line 145, in __init__
    self.attn_weights = AttentionWeights(labels_num, hidden_size*2, attn_device_ids)
  File "/home/celso/projects/AttentionXML/deepxml/modules.py", line 88, in __init__
    group_size, plus_num = labels_num // len(device_ids), labels_num % len(device_ids)
ZeroDivisionError: integer division or modulo by zero

celsofranssa commented 1 year ago

I already have this file in models dir:

FastAttentionXML-Amazon-670k-Tree-0-Level-0
FastAttentionXML-Amazon-670k-Tree-0-Level-1
FastAttentionXML-Amazon-670k-Tree-0-cluster-Level-0.npy
FastAttentionXML-Amazon-670k-Tree-0-cluster-Level-1.npy
FastAttentionXML-Amazon-670k-Tree-0-cluster-Level-2.npy
FastAttentionXML-Amazon-670k-Tree-1-Level-0
FastAttentionXML-Amazon-670k-Tree-1-Level-1
FastAttentionXML-Amazon-670k-Tree-1-cluster-Level-0.npy
FastAttentionXML-Amazon-670k-Tree-1-cluster-Level-1.npy
FastAttentionXML-Amazon-670k-Tree-1-cluster-Level-2.npy
FastAttentionXML-Amazon-670k-Tree-2-Level-0
FastAttentionXML-Amazon-670k-Tree-2-Level-1
FastAttentionXML-Amazon-670k-Tree-2-cluster-Level-0.npy
FastAttentionXML-Amazon-670k-Tree-2-cluster-Level-1.npy
FastAttentionXML-Amazon-670k-Tree-2-cluster-Level-2.npy

Could you give me some directions?

celsofranssa commented 1 year ago

It seems the cause is that line device_ids = list(range(1, torch.cuda.device_count())) . Since I have only one RTX 3090, it assigns an empty list to device_ids (due range starting from 1).

Is that the correct catch? If so, does this experiment only run on a dual GPU device?

yourh commented 1 year ago

Yes, the codes are for two or more GPUs. If you only have one GPU, you can change parallel_attn = labels_num <= most_labels_parallel_attn to parallel_attn = True in models.py .

celsofranssa commented 1 year ago

Thank you.

yourh / AttentionXML

Training Error on Amazon-670k #34