writer / replaCy

spaCy match and replace, maintaining conjugation
https://pypi.org/project/replacy/
MIT License
34 stars 8 forks source link

Support referencing matched tokens #30

Closed sam-writer closed 4 years ago

sam-writer commented 4 years ago

The idea is to handle generic cases that are needed if we want replaCy to be an option for rule-based GEC.

Here is a snippet of a match_dict with examples of this

{
    "lt-example": {
        "patterns": [
            {
                "LOWER": {
                    "IN": [
                        "have",
                        "has"
                    ]
                }
            },
            {
                "TAG": {
                    "IN": [
                        "VBD",
                        "VBP",
                        "VB"
                    ]
                }
            },
            {
                "TAG": {
                    "NOT_IN": [
                        "VBG"
                    ]
                }
            }
        ],
        "suggestions": [
            [
                {
                    "PATTERN_REF": 0
                },
                {
                    "PATTERN_REF": 1,
                    "INFLECTION": "VBG"
                },
                {
                    "PATTERN_REF": 2
                }
            ]
        ]
    },
    "extract-revenge": {
        "patterns": [
            {
                "LEMMA": "extract",
                "TEMPLATE_ID": 1
            },
            {
                "LOWER": "revenge"
            }
        ],
        "suggestions": [
            [
                {
                    "TEXT": "exact",
                    "FROM_TEMPLATE_ID": 1
                },
                {
                    "PATTERN_REF": 1
                }
            ]
        ],
        "message": "Possible agreement error -- use past participle here"
    }
}

That first pattern is my attempt to translate to replaCy the following LT pattern:

<pattern>
 <token regexp="yes">has|have</token>
 <marker>
   <token postag="VBD|VBP|VB" postag_regexp="yes">
     <exception postag="VBN|NN:U.*|JJ.*|RB" postag_regexp="yes"/>
   </token>
 </marker>
 <token><exception postag="VBG"/></token>
</pattern>
<message>
  Possible agreement error -- use past participle here:
  <suggestion><match no="2" postag="VBN"/></suggestion>.
</message>

The second pattern is an attempt to redo the extract revenge pattern using this proposed syntax. I am not saying we'd want that change (it doesn't produce a minimal diff).

sam-writer commented 4 years ago

This would also allow us to handle this inclusivity match image something like

{
    "is-addicted-to": {
        "patterns": [
            {
                "LEMMA": "be",
                "TEMPLATE_ID": 1
            },
            {
                "LOWER": "addicted"
            },
            {
                "LOWER": "to"
            },
            {
                "POS": "NOUN"
            },
        ],
        "suggestions": [
            [
                {
                    "TEXT": "has",
                    "FROM_TEMPLATE_ID": 1
                },
                {
                    "TEXT": "a"
                    // it would be cool if we could add "AUTO": true or something to get a-or-an
                    // or "FROM_SUGGESTION_REF": 2
                    // to explicitly point it at the next token
                },
                {
                    "PATTERN_REF": 3
                },
                {
                    "TEXT": "use disorder"
                },
            ]
        ],
}