writer / replaCy

spaCy match and replace, maintaining conjugation
https://pypi.org/project/replacy/
MIT License
34 stars 8 forks source link

Filter spans by containment #49

Closed sam-writer closed 4 years ago

sam-writer commented 4 years ago

In most of our replaCy-powered apps, we filter spans by containment - eg if there are 3 matches, but 1 subset 2 subset 3 then we only return 3. The logic we use is:

def span_contains(s1: Span, s2: Span):
    """
    returns Tuple[bool, Optional[Span], Optional[Span]]
    where bool indicates whether one span contains another
    and if true, returns container span, contained span
    """
    start1, end1 = s1.start, s1.end
    start2, end2 = s2.start, s2.end
    if start2 >= start1 and end2 <= end1:
        return True, s1, s2
    elif start1 >= start2 and end1 <= end2:
        return True, s2, s1
    else:
        return False, None, None

def filter_spans_by_containment(spans: List[Span]) -> List[Span]:
    span_check = {k: True for k in spans}
    for s, t in combinations(spans, 2):
        containment, container, contained = span_contains(s, t)
        if containment:
            span_check[contained] = False
    return list(filter(bool, [k if v else None for k, v in span_check.items()]))

Is this something we want to do in replaCy? If so, by default or opt-in?

sam-writer commented 4 years ago

spacy.util has a function for this. If a project is using replaCy, they are using spaCy, so it is 2 lines to call this function. I don't think we need to offer it for the user, we can't make it any more convenient than it already is

sam-writer commented 4 years ago

This is very easy now with custom pipeline components:

import en_core_web_sm
from replacy import ReplaceMatcher
from replacy.db import load_json
from spacy.util import filter_spans

nlp = en_core_web_sm.load()
replaCy = ReplaceMatcher(nlp, load_json('path to match dict(s)'))
replaCy.add_pipe(filter_spans, name="filter_spans", before="joiner")

Though... I think this maybe should be the default behavior

melisa-writer commented 4 years ago

Put the example into our wiki. I think the less we define as default - the less we need to explain at the very beginning - the more accessible is replaCy.