wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
553 stars 191 forks source link

Move entities_list to config file; Add an option to specify image extension #43

Closed dbobrenko closed 3 years ago

dbobrenko commented 3 years ago

This PR consists from 3 commits:

1. Change absolute path to the example dataset in config file to the relative one, so that train.py can work from scratch without any need to modify config file.

2. Move entities_list to config file, so that all configuration steps for custom datasets can be done through config file, without any need to change code.

Detailed commit summary Refactored parts of code that uses old version of `Entities_List` in order to make it able to work with `entities_list` that is modified after config file is parsed, during runtime. Global variables `keys_vocab_cls`, `iob_labels` and `entities_vocab_cls` defined `utils/class_utils.py` were moved to the global dict `vocal_cls` in order to make it possible to reassign `keys_vocab_cls`, `iob_labels` and `entities_vocab_cls` global variables without losing reference to them. File `utils/entities.py` was removed from the project, since `entities_list` is already declared in `config.json`. Part with `entities_list` for custom datasets in `README.md` was updated. Entities list was added to `config.json` in order to make training on example dataset work from scratch.

3. Add an option to specify image extension to config.json file.

Detailed commit summary Added image extension to config file, so that custom datasets with image extension that differs from `.jpg` can be configurated from config file. Supported extensions are:`.jpg` and `.png`. Updated `README.md` specifies the option to change the image extension.

All tests successfully passed. Training on example dataset converges as before.

wenwenyu commented 3 years ago

Hi, thank you for your improvement. I have checked this code and it is useful and necessary. So I will merge this PR.

tengerye commented 3 years ago

@dbobrenko great job.

@wenwenyu @dbobrenko I would like to discuss some of my thoughts for further refinement.

  1. I think both train_dataset and validation_dataset shall share the same entities_list, what do you think?
  2. Could we assume the extension of image names are the same?
wenwenyu commented 3 years ago

@tengerye @dbobrenko

For the first advice, I agree with you. Sharing the same entities_list in coifig.json is a more suitable and convenient edit. But the train and the valid datasets maybe not the same in some situations. So image_ext args I prefer to separate it.

What do you think?

tengerye commented 3 years ago

@wenwenyu @dbobrenko OK, I see. How about fetching all the images (no matter the extensions) from the specific directory? As the extensions are not common parameters in pytorch's datasets from other repository.

dbobrenko commented 3 years ago

@wenwenyu @tengerye Thanks for your response and for accepting this PR!

@wenwenyu That's a good point, was thinking about sharing the same entities_list for both train_dataset and validation_dataset, but it requires more refactoring, and the PR would significantly grow in size. But it's definitely a good suggestion

Regarding image_ext @tengerye mentioned good approach to fetch all images found in directory Another scenario would be to take image_ext as a list of supported image formats (e.g jpg, jpeg, png, gif, bmp, etc), and read all detected images with given formats. I think 1st approach is much more friendly

wenwenyu commented 3 years ago

@tengerye Fetching all the images from the specific directory is good advice.

@dbobrenko It is true that the code requires more refactoring if we want to share entities_list args in config.json

According to our discussion, I suggest that:

  1. entities_list args do not do anything for the current version
  2. remove image_ext args and replace it with fetching all the images from the specific directory.

What do you think?

wenwenyu commented 3 years ago

@dbobrenko @tengerye The entities_list args may have long content in some situations, and it is not suitable to place entities_list in the config file. So I have decided to separate the entities_list args from the config.json file, same as the original version.

dbobrenko commented 3 years ago

@dbobrenko @tengerye The entities_list args may have long content in some situations, and it is not suitable to place entities_list in the config file. So I have decided to separate the entities_list args from the config.json file, same as the original version.

@wenwenyu you are right, in some cases entities lists can be very large. I think in this case the ideal solution would be to create a separate json with entities. So that user don't need to change any code files and pick can be easily packaged