Closed dbobrenko closed 3 years ago
Hi, thank you for your improvement. I have checked this code and it is useful and necessary. So I will merge this PR.
@dbobrenko great job.
@wenwenyu @dbobrenko I would like to discuss some of my thoughts for further refinement.
train_dataset
and validation_dataset
shall share the same entities_list
, what do you think?@tengerye @dbobrenko
For the first advice, I agree with you. Sharing the same entities_list
in coifig.json
is a more suitable and convenient edit.
But the train and the valid datasets maybe not the same in some situations. So image_ext
args I prefer to separate it.
What do you think?
@wenwenyu @dbobrenko OK, I see. How about fetching all the images (no matter the extensions) from the specific directory? As the extensions are not common parameters in pytorch's datasets from other repository.
@wenwenyu @tengerye Thanks for your response and for accepting this PR!
@wenwenyu That's a good point, was thinking about sharing the same entities_list
for both train_dataset
and validation_dataset
, but it requires more refactoring, and the PR would significantly grow in size. But it's definitely a good suggestion
Regarding image_ext
@tengerye mentioned good approach to fetch all images found in directory
Another scenario would be to take image_ext
as a list of supported image formats (e.g jpg
, jpeg
, png
, gif
, bmp
, etc), and read all detected images with given formats. I think 1st approach is much more friendly
@tengerye Fetching all the images from the specific directory is good advice.
@dbobrenko It is true that the code requires more refactoring if we want to share entities_list
args in config.json
According to our discussion, I suggest that:
entities_list
args do not do anything for the current versionimage_ext
args and replace it with fetching all the images from the specific directory.What do you think?
@dbobrenko @tengerye The entities_list
args may have long content in some situations, and it is not suitable to place entities_list
in the config file. So I have decided to separate the entities_list
args from the config.json
file, same as the original version.
@dbobrenko @tengerye The
entities_list
args may have long content in some situations, and it is not suitable to placeentities_list
in the config file. So I have decided to separate theentities_list
args from theconfig.json
file, same as the original version.
@wenwenyu you are right, in some cases entities lists can be very large. I think in this case the ideal solution would be to create a separate json with entities. So that user don't need to change any code files and pick can be easily packaged
This PR consists from 3 commits:
1. Change absolute path to the example dataset in config file to the relative one, so that train.py can work from scratch without any need to modify config file.
2. Move
entities_list
to config file, so that all configuration steps for custom datasets can be done through config file, without any need to change code.Detailed commit summary
Refactored parts of code that uses old version of `Entities_List` in order to make it able to work with `entities_list` that is modified after config file is parsed, during runtime. Global variables `keys_vocab_cls`, `iob_labels` and `entities_vocab_cls` defined `utils/class_utils.py` were moved to the global dict `vocal_cls` in order to make it possible to reassign `keys_vocab_cls`, `iob_labels` and `entities_vocab_cls` global variables without losing reference to them. File `utils/entities.py` was removed from the project, since `entities_list` is already declared in `config.json`. Part with `entities_list` for custom datasets in `README.md` was updated. Entities list was added to `config.json` in order to make training on example dataset work from scratch.3. Add an option to specify image extension to
config.json
file.Detailed commit summary
Added image extension to config file, so that custom datasets with image extension that differs from `.jpg` can be configurated from config file. Supported extensions are:`.jpg` and `.png`. Updated `README.md` specifies the option to change the image extension.All tests successfully passed. Training on example dataset converges as before.