writer / fitbert

Use BERT to Fill in the Blanks
https://pypi.org/project/fitbert/
Apache License 2.0
82 stars 14 forks source link

Remove redundancy in returned delems when invoking Delemmatize #23

Closed sturmianseq closed 2 years ago

sturmianseq commented 3 years ago

This PR aims to remove the redundancy in the returned result of calling a Delemmatizer object. For example, the test test_delemmatizes_lemmas and test_delemmatizes_non_lemmas can fail on the second run if running twice due to the redundant elements in the returned results.

    def test_delemmatizes_lemmas():
>       assert dl("look") == [
            "looked",
            "looking",
            "looks",
            "look",
        ], "should delemmatize lemmas"
E       AssertionError: should delemmatize lemmas
E       assert ['looked', 'l...look', 'look'] == ['looked', 'l...ooks', 'look']
E         Left contains one more item: 'look'
E         Use -v to get the full diff

In the above error message, we see that look appears twice in the returned result.

This PR can fix this kind of issue: instead of directly appending the word into delems, my fix is to first check whether word already exists in delems before adding it so that there are no redundant elements.