wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
553 stars 191 forks source link

Training on colab #37

Open vpvsankar opened 3 years ago

vpvsankar commented 3 years ago

Is it possible to train this model on colab? I have a small dataset.

dipesh-commits commented 3 years ago

I believe colab doesn't provide distributed training. Currently, the code in this repo runs on distributed server. You can modify the code regarding distributed training and train on colab.

kbrajwani commented 3 years ago

Here i found blog https://medium.com/analytics-vidhya/extracting-structured-data-from-invoice-96cf5e548e40 in which they upload colab notebook. That will show how to preprocess SROIE dataset for this repo and train on colab.

vpvsankar commented 3 years ago

@kbrajwani thank you so much brother : ). were you able to get the results mentioned in the paper?

kbrajwani commented 3 years ago

@vpvsankar i have not trained more than 30 epochs as its takes too much time so i didn't go to compared the result.

vpvsankar commented 3 years ago

I am getting this error, but the path i gave is correct

File "train.py", line 162, in entry_point(config) File "train.py", line 126, in entry_point main(config, local_master, logger if local_master else None) File "train.py", line 34, in main train_dataset = config.init_obj('train_dataset', pick_dataset_module) File "/content/drive/My Drive/PICK-pytorch/parse_config.py", line 105, in init_obj return getattr(module, module_name)(*args, **module_args) File "/content/drive/My Drive/PICK-pytorch/data_utils/pick_dataset.py", line 64, in init raise FileNotFoundError('Entity folder is not exist!') FileNotFoundError: Entity folder is not exist! Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '-c', 'config.json', '-d', '0', '--local_world_size', '1']' returned non-zero exit status 1.

vpvsankar commented 3 years ago

It worked thanks

tengerye commented 3 years ago

@kbrajwani Hi, great job! Would you like to merge your code for processing the SROIE into this repository? We can't do this currently for some reasons.

kbrajwani commented 3 years ago

@tengerye I would love to do but first of all let me tell you its not so much perfect. If you are okay with my logic i will merge the code. Here is my logic

for key,value in sorted(entities.items()):

Here df[9] is transcript which comes from box/ folder csv file

## if transcript contain entity value then i am adding key to that transcript  

idx = df[df[9].str.contains('|'.join(map(str.strip, value.split(','))))].index
df.loc[idx, 10] = key

## df[idx,10] is ner tag for that transcript as per require in PICk-pytorch boxes_and_transcripts folder
## index ,coordinates x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1, transcript , ner tag

Due to intended I can't able to explain better here. It's well explains in blog.

tengerye commented 3 years ago

@kbrajwani Thanks for your kind reply. I will take a look at your blog.

tomaschild commented 3 years ago

I'm getting this error when running the training while trying the distributed and non-distributed way:

Traceback (most recent call last): File "train.py", line 166, in entry_point(config) File "train.py", line 130, in entry_point main(config, local_master, logger if local_master else None) File "train.py", line 35, in main set_vocab(config['train_dataset']['args']['entities_list']) KeyError: 'entities_list'

Any clue about how to fix it?

kbrajwani commented 3 years ago

@tomaschild if you are using blog notebook please clone their fork repo https://github.com/dlmade/Pick.Pytorch.Sroie which they have used while training. it already have preprocessed sroie dataset.

pranavstha11 commented 3 years ago

I am getting this error, but the path i gave is correct

File "train.py", line 162, in entry_point(config) File "train.py", line 126, in entry_point main(config, local_master, logger if local_master else None) File "train.py", line 34, in main train_dataset = config.init_obj('train_dataset', pick_dataset_module) File "/content/drive/My Drive/PICK-pytorch/parse_config.py", line 105, in init_obj return getattr(module, module_name)(*args, module_args) File "/content/drive/My Drive/PICK-pytorch/data_utils/pick_dataset.py", line 64, in init raise FileNotFoundError('Entity folder is not exist!') FileNotFoundError: Entity folder is not exist! Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main**", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '-c', 'config.json', '-d', '0', '--local_world_size', '1']' returned non-zero exit status 1.

It worked thanks

@vpvsankar How did you solve it? I am also getting the same error. Please help.

keshav-qubitrics commented 3 years ago

@kbrajwani @pranavstha11 Did you found the solution for error: Entity folderis not exist?

dlmade commented 3 years ago

Hey @keshav-qubitrics , There is error in https://github.com/dlmade/Pick.Pytorch.Sroie/blob/master/config.json file. you have to change the path of data as per your working directory. Check line number 61 to 64 and 73 to 76.

keshav-qubitrics commented 3 years ago

Thank you for your response @dlmade