wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
553 stars 191 forks source link

Training using Nvidia A100 GPU #114

Open ajaysurya1221 opened 2 years ago

ajaysurya1221 commented 2 years ago

Hi, i'm using one A100 GPU to train PICK and i've set distributed to false.

[2022-06-08 01:41:58,561 - train - INFO] - One GPU or CPU training mode start... [2022-06-08 01:41:58,565 - train - INFO] - Dataloader instances created. Train datasets: 100 samples Validation datasets: 20 samples. [2022-06-08 01:41:59,276 - train - INFO] - Model created, trainable parameters: 68571598. [2022-06-08 01:41:59,277 - train - INFO] - Optimizer and lr_scheduler created. [2022-06-08 01:41:59,277 - train - INFO] - Max_epochs: 35 Log_per_step: 20 Validation_per_step: 100. [2022-06-08 01:41:59,277 - train - INFO] - Training start... [2022-06-08 01:41:59,289 - trainer - WARNING] - Training is using GPU 0!

I've been struck here for so long and after 10-15 mins, it throws CuDNN error. any solution?

cuda version = 10.1 and pythorch = 1.5.1+101

bankh commented 1 year ago

@ajaysurya1221 A100 has Ampere architecture with so-called computational capabilities of sm_8x. Some parts of the cuda computations are not running the way they should under cu101 as required in PICK's implementation. You can try different Pytorch versions with different cuda (e.g., cu111). You will experience different issues with cu111 if you used that one, i.e., on the decoder side of the model. There are a few quick patches that you can utilize to avoid those issues as well.