writer / replaCy

spaCy match and replace, maintaining conjugation
https://pypi.org/project/replacy/
MIT License
34 stars 8 forks source link

Span overlap ESpan class #96

Closed sam-writer closed 3 years ago

sam-writer commented 3 years ago

Tested this in the REPL:

>>> from replacy import ESpan
>>> from replacy import ReplaceMatcher
>>> from replacy.db import load_json
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> match_dict = load_json('./replacy/resources/match_dict.json')
>>> ematcher = ReplaceMatcher(nlp, match_dict=match_dict, SpanClass=ESpan)
>>> s = ematcher("She extracts revenge.")[0]
>>> s
extracts
>>> s.suggestions
['exacts']
>>> s.match_name
'extract-revenge'
>>> doc = nlp("She extracts revenge.")
>>> e = ESpan(doc, 1, 2)
>>> e
extracts
>>> e.suggestions
[]
>>> e.match_name
''
>>> e.comment
''
>>> e.vector
array([ 3.7579389 , 390608 ,  ... , 0.37205344], dtype=float32)
>>> e.kb_id
0
>>> e.vector_norm
25.310467
>>> e._.comment = "yo metaprogramming"
>>> e.comment
'yo metaprogramming'
>>> e == e._
True

I added the following to the match_dict to check for overlap behavior

{
  "make-due": {
    "patterns": [
      {
        "LEMMA": "make",
        "TEMPLATE_ID": 1
      },
      {
        "LOWER": "due"
      }
    ],
    "suggestions": [
      [
        {
          "TEXT": "make",
          "FROM_TEMPLATE_ID": 1
        },
        {
          "TEXT": "do"
        }
      ]
    ]
  },
  "dupe-test": {
    "patterns": [
      {
        "LEMMA": "make",
        "TEMPLATE_ID": 1
      }
    ],
    "suggestions": [
      [
        {
          "TEXT": "build",
          "FROM_TEMPLATE_ID": 1
        }
      ]
    ],
    "comment": "This is a bad match, it is here to demonstrate overlap behavior",
    "test": {
      "positive": ["I will make something"],
      "negative": []
    }
  }

this gives

>>> spans = ematcher("I will make due")
>>> spans
[make, make due]
>>> spans[0].match_name
'dupe-test'
>>> spans[1].match_name
'make-due'
>>> spans[1].suggestions
['make do']
>>> spans[0].suggestions
['build']
sam-writer commented 3 years ago

Added a factory method so you don't have to import ESpan, not sure which I like more:

from replacy import ReplaceMatcher
from replacy.db import load_json
import spacy
nlp = spacy.load("en_core_web_sm")
match_dict = load_json('./replacy/resources/match_dict.json')
ematcher = ReplaceMatcher.with_espan(nlp, match_dict=match_dict)
s = ematcher("She extracts revenge.")[0]

thoughts?

manhal-daaboul commented 3 years ago

The match dict in your test doesn't represent the overlap problem we are trying to solve, this one does: STILL WORKS with this one

{
    "make-1": {
        "patterns": [
            {
                "LEMMA": "make"
            }
        ],
        "suggestions": [
            [
                {
                    "TEXT": "MAKE"
                }
            ]
        ],
        "subcategory": "MAKE_CAPS"
    },
    "make-2": {
        "patterns": [
            {
                "LEMMA": "make"
            }
        ],
        "suggestions": [
            [
                {
                    "TEXT": "MaKe"
                }
            ]
        ],
        "subcategory": "MAKE_STYLE",
        "comment": "This is a bad match, it is here to demonstrate overlap behavior",
    }
}

Tested this in the REPL:

>>> from replacy import ESpan
>>> from replacy import ReplaceMatcher
>>> from replacy.db import load_json
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> match_dict = load_json('./replacy/resources/match_dict.json')
>>> ematcher = ReplaceMatcher(nlp, match_dict=match_dict, SpanClass=ESpan)
>>> s = ematcher("She extracts revenge.")[0]
>>> s
extracts
>>> s.suggestions
['exacts']
>>> s.match_name
'extract-revenge'
>>> doc = nlp("She extracts revenge.")
>>> e = ESpan(doc, 1, 2)
>>> e
extracts
>>> e.suggestions
[]
>>> e.match_name
''
>>> e.comment
''
>>> e.vector
array([ 3.7579389 , 390608 ,  ... , 0.37205344], dtype=float32)
>>> e.kb_id
0
>>> e.vector_norm
25.310467
>>> e._.comment = "yo metaprogramming"
>>> e.comment
'yo metaprogramming'
>>> e == e._
True

I added the following to the match_dict to check for overlap behavior

{
  "make-due": {
    "patterns": [
      {
        "LEMMA": "make",
        "TEMPLATE_ID": 1
      },
      {
        "LOWER": "due"
      }
    ],
    "suggestions": [
      [
        {
          "TEXT": "make",
          "FROM_TEMPLATE_ID": 1
        },
        {
          "TEXT": "do"
        }
      ]
    ]
  },
  "dupe-test": {
    "patterns": [
      {
        "LEMMA": "make",
        "TEMPLATE_ID": 1
      }
    ],
    "suggestions": [
      [
        {
          "TEXT": "build",
          "FROM_TEMPLATE_ID": 1
        }
      ]
    ],
    "comment": "This is a bad match, it is here to demonstrate overlap behavior",
    "test": {
      "positive": ["I will make something"],
      "negative": []
    }
  }

this gives

>>> spans = ematcher("I will make due")
>>> spans
[make, make due]
>>> spans[0].match_name
'dupe-test'
>>> spans[1].match_name
'make-due'
>>> spans[1].suggestions
['make do']
>>> spans[0].suggestions
['build']
manhal-daaboul commented 3 years ago

@sam-qordoba all good, just one change: added has_extension implementation