Stuck in Training start ...

stanlo229 commented 2 years ago

I downloaded the bindingdb and lit_pcba datasets, and put them in the correct directories. However, I encounter this problem where I'm stuck at "Training start ...". demo.py works properly. So I'm not sure what's wrong. Please help!

Inference process of bindingdb_c start... Solver for bindingdb_c running start @ Fri Jul 15 16:40:52 2022 1 gpus available Configuration 0 start: config_id is 311c2 config is {'dataset': 'bindingdb_c', 'hid_dim_alpha': 2, 'e_dim': 2048, 'mol_block': '_TripletMessage', 'pro_block': '_GATConv', 'message_steps': 2, 'mol_readout': 'GlobalPool5', 'pro_readout': 'GlobalPool5', 'pre_do': 'Dropout(0.1)', 'graph_do': 'Dropout(0.1)', 'flat_do': 'Dropout(0.2)', 'end_do': 'Dropout(0.1)', 'pre_norm': '_None', 'graph_norm': '_LayerNorm', 'flat_norm': '_None', 'end_norm': '_LayerNorm', 'pre_act': 'LeakyReLU', 'graph_act': 'RReLU', 'flat_act': 'ReLU', 'end_act': 'CELU', 'graph_res': 1, 'loss': 'ce', 'batch_size': 64, 'optim': 'Ranger', 'k': 6, 'epochs': 20, 'lr': 0.0001, 'early_stop_patience': 50} Choosing the GPU device has largest free memory... Sorted by free memory size Using GPU 0: index:0 gpu_name:NVIDIA GeForce RTX 3070 Ti memory.free:7243 memory.total:8192 0 Choosing the GPU device has largest free memory... Sorted by free memory size Using GPU 0: index:0 gpu_name:NVIDIA GeForce RTX 3070 Ti memory.free:7243 memory.total:8192 Choosing the GPU device has largest free memory... Sorted by free memory size Using GPU 0: index:0 gpu_name:NVIDIA GeForce RTX 3070 Ti memory.free:7243 memory.total:8192 python3 run.py --dataset bindingdb_c --hid_dim_alpha 2 --e_dim 2048 --mol_block _TripletMessage --pro_block _GATConv --message_steps 2 --mol_readout GlobalPool5 --pro_readout GlobalPool5 --pre_do Dropout(0.1) --graph_do Dropout(0.1) --flat_do Dropout(0.2) --end_do Dropout(0.1) --pre_norm _None --graph_norm _LayerNorm --flat_norm _None --end_norm _LayerNorm --pre_act LeakyReLU --graph_act RReLU --flat_act ReLU --end_act CELU --graph_res 1 --loss ce --batch_size 64 --optim Ranger --k 6 --epochs 20 --lr 0.0001 --early_stop_patience 50 --note 311c2 --gpu 0 --seed 1 Loading dataset... Training init... ################################################################################ dataset_root:../../Dataset/GLAM-DTI dataset:bindingdb_c seed:1 gpu:0 note:311c2 hid_dim_alpha:2 mol_block:_TripletMessage pro_block:_GATConv e_dim:2048 out_dim:2 message_steps:2 mol_readout:GlobalPool5 pro_readout:GlobalPool5 pre_norm:_None graph_norm:_LayerNorm flat_norm:_None end_norm:_LayerNorm pre_do:Dropout(0.1) graph_do:Dropout(0.1) flat_do:Dropout(0.2) end_do:Dropout(0.1) pre_act:LeakyReLU graph_act:RReLU flat_act:ReLU end_act:CELU graph_res:1 batch_size:64 epochs:20 loss:ce optim:Ranger k:6 lr:0.0001 lr_reduce_rate:0.7 lr_reduce_patience:20 early_stop_patience:50 verbose_patience:2000 ################################################################################ save id: 2022-07-15_20:40:59.028_seed_1 run device: cuda:0 train set num:48006 valid set num:5475 test set num: 5371 total parameters:165232 ################################################################################ Architecture( (mol_lin0): LinearBlock( (norm): _None() (dropout): Dropout(p=0.1, inplace=False) (linear): Linear(in_features=15, out_features=30, bias=True) (act): LeakyReLU(negative_slope=0.01) ) (pro_lin0): LinearBlock( (norm): _None() (dropout): Dropout(p=0.1, inplace=False) (linear): Linear(in_features=49, out_features=30, bias=True) (act): LeakyReLU(negative_slope=0.01) ) (mol_conv): MessageBlock( (norm): _LayerNorm( (norm): LayerNorm(30) ) (dropout): Dropout(p=0.1, inplace=False) (conv): _TripletMessage( (conv): TripletMessage() ) (gru): GRU(30, 30) (act): RReLU(lower=0.125, upper=0.3333333333333333) ) (pro_conv): MessageBlock( (norm): _LayerNorm( (norm): LayerNorm(30) ) (dropout): Dropout(p=0.1, inplace=False) (conv): _GATConv( (conv): GATConv(30, 30, heads=1) ) (gru): None (act): RReLU(lower=0.125, upper=0.3333333333333333) ) (mol_readout): GlobalPool5() (pro_readout): GlobalPool5() (mol_flat): LinearBlock( (norm): _None() (dropout): Dropout(p=0.2, inplace=False) (linear): Linear(in_features=150, out_features=30, bias=True) (act): ReLU() ) (pro_flat): LinearBlock( (norm): _None() (dropout): Dropout(p=0.2, inplace=False) (linear): Linear(in_features=150, out_features=30, bias=True) (act): ReLU() ) (lin_out0): LinearBlock( (norm): _LayerNorm( (norm): LayerNorm(64) ) (dropout): Dropout(p=0.1, inplace=False) (linear): Linear(in_features=64, out_features=2048, bias=True) (act): CELU(alpha=1.0) ) (lin_out1): LinearBlock( (norm): _LayerNorm( (norm): LayerNorm(2048) ) (dropout): Dropout(p=0.1, inplace=False) (linear): Linear(in_features=2048, out_features=2, bias=True) (act): _None() ) ) ################################################################################ Training start... 0%| | 0/20 [00:00<?, ?it/sbatch 0 training loss: 0.79090 time elapsed 0.00 hrs (0.0 mins)

stanlo229 commented 2 years ago

I've also tried the perturbed dataset and it works fine. It seems like when I use datasets from your given URLs, there's some kind of problem.

stanlo229 commented 2 years ago

Do you mind giving me your folder of data for BindingDB?

stanlo229 commented 2 years ago

Nevermind, fixed it! Just had to wait for a bit. If anyone else encounters this error, make sure your data makes sense in regards to dataset.py.

yvquanli commented 2 years ago

Thank you for your issues. From the log, I guess that the dataset processing works well, and the dataloader also works well. I'm not sure about the reason about the stuck. If the stuck without any error raise, it actually is running as normal, with the speed is a bit slow.(The loading efficiency of data is not high, resulting in slower running speed.) If the process was shut down with unknown reason, I have no idea. If the card memory is not enough, maybe you can try a gpu with bigger card memory.

yvquanli / GLAM

Stuck in Training start ... #3