wellcometrust / deep_reference_parser

A deep learning model for extracting references from text
MIT License
24 stars 1 forks source link

Mismatching dimensions when predicting with multitask models #26

Closed ivyleavedtoadflax closed 4 years ago

ivyleavedtoadflax commented 4 years ago

This issue occurs in #25. When running predict via the split_parse command, the following error results:

(virtualenv)  $ python -m deep_reference_parser split_parse "Upson MA (2019). This is a reference. In a journal. 16(1) 1-23"
Using TensorFlow backend.
ℹ Using config file:
ℹ Attempting to download model artefacts if they are not found locally
in models/multitask/2020.3.18_multitask/. This may take some time...
✔ Found models/multitask/2020.3.18_multitask/indices.pickle
✔ Found models/multitask/2020.3.18_multitask/weights.h5
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt
Traceback (most recent call last):
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 347 and 324. Shapes are [347,100] and [324,100]. for 'Assign' (op: 'Assign') with input shapes: [347,100], [324,100].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/__main__.py", line 30, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/split_parse.py", line 198, in split_parse
    out = mt.split_parse(text, return_tokens=tokens, verbose=True)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/split_parse.py", line 116, in split_parse
    preds = self.drp.predict(tokens, load_weights=True)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/deep_reference_parser.py", line 1026, in predict
  File "/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/deep_reference_parser.py", line 997, in load_weights
    self.model, self.weights_path, include_optimizer=False
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/keras_contrib/utils/save_load_utils.py", line 97, in load_all_weights
    saving.load_weights_from_hdf5_group(f['model_weights'], model.layers)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/keras/engine/saving.py", line 1199, in load_weights_from_hdf5_group
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2727, in batch_set_value
    assign_op = x.assign(assign_placeholder)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 2067, in assign
    self._variable, value, use_locking=use_locking, name=name)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/ops/state_ops.py", line 227, in assign
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_state_ops.py", line 66, in assign
    use_locking=use_locking, name=name)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1770, in __init__
  File "/home/matthew/Documents/wellcome/deep_reference_parser/build/virtualenv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 347 and 324. Shapes are [347,100] and [324,100]. for 'Assign' (op: 'Assign') with input shapes: [347,100], [324,100].
ivyleavedtoadflax commented 4 years ago

Seems to be occurring in https://github.com/wellcometrust/deep_reference_parser/blob/f392f9fef9025ec2bbb9317faeaf1c4f76abadc4/deep_reference_parser/deep_reference_parser.py#L972, but i suspect that something has gone awry elsewhere.

lizgzil commented 4 years ago

interesting my error has a slightly different number ValueError: Dimension 0 in both shapes must be equal, but are 347 and 385. Shapes are [347,100] and [385,100]. for 'Assign' (op: 'Assign') with input shapes: [347,100], [385,100].

lizgzil commented 4 years ago

(This is more for my records than anything else) I ran this for debugging so it didn't take ages loading with the model artefacts:

import os
from keras_contrib.utils import save_load_utils
from deep_reference_parser.common import MULTITASK_CFG
from deep_reference_parser.model_utils import get_config
from deep_reference_parser.reference_utils import break_into_chunks
from deep_reference_parser.deep_reference_parser import DeepReferenceParser
import en_core_web_sm

text = 'Upson MA (2019). This is a reference. In a journal. 16(1) 1-23'

cfg = get_config(config_file)
MAX_WORDS = int(cfg["data"]["line_limit"]) 
OUTPUT = cfg["build"]["output"]
PRETRAINED_EMBEDDING = cfg["build"]["pretrained_embedding"]
DROPOUT = float(cfg["build"]["dropout"])
LSTM_HIDDEN = int(cfg["build"]["lstm_hidden"])
WORD_EMBEDDING_SIZE = int(cfg["build"]["word_embedding_size"])
CHAR_EMBEDDING_SIZE = int(cfg["build"]["char_embedding_size"])
WORD_EMBEDDINGS = cfg["build"]["word_embeddings"]
OUTPUT_PATH = cfg["build"]["output_path"]

nlp = en_core_web_sm.load()
doc = nlp(text)
chunks = break_into_chunks(doc, max_words=MAX_WORDS)
tokens = [[token.text for token in chunk] for chunk in chunks]

drp = DeepReferenceParser(output_path=OUTPUT_PATH)


drp.predict(tokens, load_weights=True)

## Fails here, but let's look into into predict:

weights_path = os.path.join(OUTPUT_PATH, "weights.h5")
            drp.model, weights_path, include_optimizer=False)

## Same error here
lizgzil commented 4 years ago

Could it be to do with jgcbrouns's answer here "Yea so this is a problem with the classes file, model file and/or anchor file not matching. Make sure that the same classes.txt file (the file where per new line your classes are defined) matches during training and during inference (test). In my case I used 2 different classes.txt file. One file had 4 categories and the other one had only 1 class."

ivyleavedtoadflax commented 4 years ago

Yes I suppose it is possible, and I was having some issues with an empty class creeping in, if you recall? Not sure where it would have occurred in the current logic though...

ivyleavedtoadflax commented 4 years ago

note that this only occurs in the multitask scenario, so it's got to be something specific about it...

lizgzil commented 4 years ago

Is this what you expected? (i.e. the last length of 886443).

>>> train_data = load_tsv(POLICY_TRAIN)
>>> [len(l) for l in train_data[0]] # same for [len(l) for l in train_data[1]]

[150, 81, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 88, 150, 150, 121, 150, 150, 150, 150, 150, 150, 150, 58, 150, 150, 150, 150, 150, 108, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 5, 150, 150, 150, 150, 81, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 32, 150, 150, 150, 150, 150, 54, 150, 150, 150, 150, 150, 150, 150, ... 89, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 127, 150, 150, 110, 886443]

(test and valid data don't have the last element >150)

ivyleavedtoadflax commented 4 years ago

no that's not expected, these should all be 150 or less. The 886443 is the Rodrigues data. In the datalabs cleanup PR i made some changes in the 2020.3.19 recipe that should fix the Rodrigues data. The very short values are caused, I suspect, by prodigy_to_tsv respecting doc endings. If you remove the -d flag in the tsv_Makefile 2020.3.19 recipe, and run the 2020.3.19 model again, these inputs should all be 150.

ivyleavedtoadflax commented 4 years ago

[150, 81, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 88, 150, 150, 121, 150, 150, 150, 150, 150, 150, 150, 58, 150, 150, 150, 150, 150, 108, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 5, 150, 150, 150, 150, 81, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 32, 150, 150, 150, 150, 150, 54, 150, 150, 150, 150, 150, 150, 150, ... 89, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 127, 150, 150, 110, 886443]

This is interesting in fact because these values are sequence length (i.e. 150 = 150 tokens). That means that the final value will get truncated to 150 because the 2020.3.18 model had a line_length set at 150. Subsequent model runs which made better use of the Rodrigues data (like 2020.3.19 and 2020.3.20) by ensuring that it was cut into sequences of, say, 150. These models performed less well. This suggests to me that the Rodrigues data is making the model worse, not better...

ivyleavedtoadflax commented 4 years ago

I'm going to have a play with #28 over the weekend. If it works out it may also fix this issue.

ivyleavedtoadflax commented 4 years ago

I'm going to have a play with #28 over the weekend. If it works out it may also fix this issue.

So it's not going to fix anything anytime soon. But I hope that the CRF layer will be included in tensorflow addons soon, and then we will be able to update the model to use tf 2.0. In the meantime this problem persists.