wellcometrust / deep_reference_parser

A deep learning model for extracting references from text
MIT License
24 stars 1 forks source link

Consider spans in output #35

Open lizgzil opened 4 years ago

lizgzil commented 4 years ago

In the output of split_parser, split and parser we have an output of tokens and predictions.

It may be worth considering a different type of output with the spans of each reference/token rather than the tokens themselves.

nsorros commented 4 years ago

I am not sure how controversial this would be but it would definitely eliminate the need to merge tokens after as the algorithm would extract start and end for each component in a QA fashion

ivyleavedtoadflax commented 4 years ago

I thought of these outputs as placeholders. All those scripts are not suitable for production because they would instantiate the model every time they made a prediction, so their utility is somewhat limited. That said, I think I implemented an --output flag which will dump the output to a json.

lizgzil commented 4 years ago

@ivyleavedtoadflax ok that makes sense re outputs.

In terms of the instantiation of the model, is it not true that

splitter_parser = SplitParser(config_file=MULTITASK_CFG)

instantiates the model and then you could do

reference_predictions = splitter_parser.split_parse(text)

as many times as you wanted without having to reinstantiate the model?

nsorros commented 4 years ago

@ivyleavedtoadflax ok that makes sense re outputs.

In terms of the instantiation of the model, is it not true that

splitter_parser = SplitParser(config_file=MULTITASK_CFG)

instantiates the model and then you could do

reference_predictions = splitter_parser.split_parse(text)

as many times as you wanted without having to reinstantiate the model?

Even though unrelated to this issue, I am almost 100% you are right. @ivyleavedtoadflax can confirm.

ivyleavedtoadflax commented 4 years ago

Yup exactly right @lizgzil. That's not how I had done it in the split, parse, split_parse commands, which is why they are no good for prod.