writer / replaCy

spaCy match and replace, maintaining conjugation
https://pypi.org/project/replacy/
MIT License
34 stars 8 forks source link

NLP-53 support suggestions REF with OP in pattern in spacy 3 #110

Closed manhal-daaboul closed 3 years ago

manhal-daaboul commented 3 years ago

This support works for projects with spacy >= 3.0.6

Adding this support to previous versions were very complex (couldn't get to stable version)

sam-writer commented 3 years ago

I know you are going to hate this, but... I am tired of the indentation war. We use black-style, not Pycharm-style. I want you to be able to use whatever editor you want, but the indentation changes add noise. I cannot believe that with all of their settings, PyCharm won't let you set the continuing indentation value to 4 instead of 8!? The docs make it seem like you can in "code style > Python > tabs and indents > continuation indent" ?

You can read more about the justification for Black's choice if you need to https://github.com/psf/black/issues/1178 - I find the Black team's arguments good, and personally find their choice to be the best one.

The company pays for the PyCharm license right? If it really is impossible to get PyCharm to indent in a black-compatible way, then we need to file a support ticket and have them fix it.

manhal-daaboul commented 3 years ago

added black support to pycharm, then reformatted all the files in this PR

sam-writer commented 3 years ago

A lot of these changes make it so replaCy doesn't work as well, or at all, with spaCy 2, right? The extra list nesting of the patterns is the main thing I'm thinking of.

I think one thing that is happening here is that we stop using spaCy's system for callbacks, and instead run them on our own, after the matcher. I think this isn't a big change, looking at spaCy's source. But it feels strange, to not use the system that is there. It seems like we want to invoke the callback sooner, because it calls SuggestionGenerator, which in turn calls a RefMatcher, and if we are calling RefMatcher36, it needs to be passed alignments, which are new to spaCy 3, and the 4th element of the tuple that is a match, id, start, end, alignments... but I still can't see why we have to not use the matcher? is it the call to if spacy_has_alignment_info(): on line 233? does that not work when it is in a callback?

Going to throw something out there: we decided a while ago to support both spaCy 2 and 3 in the same codebase because it seemed like it wouldn't be painful to do... it now seems like we may have been wrong? Maybe we should split replaCy at this point into two versions? and stop maintaining the v2 one as soon as we can move all our services over to spaCy 3?

Or maybe: do we need to be this drastic in the changes? Since replaCy can only ever support a single pattern, not a list of patterns, do we really need to change the format of our JSONs? or can we just add the extra List nesting before processing the dict?

manhal-daaboul commented 3 years ago

we switched from spacy callback because spacy callback doesn't pass alignment info to the callback function.

The current code still support both spacy 2 and 3. While tests only support spacy 3. We can still rewrite tests to work for both versions by adding extra list before processing the dict, as you suggested, but do we actually need this?

At this point we are still able to support both spacy versions. I do agree at some point later we need to stop doing this.

sam-writer commented 3 years ago

Oh, we need to support spaCy 2 for a while. For example, ERRANT only works with spaCy 2, and GEC also uses replaCy. So we need to support spaCy 2 until: we rewrite ERRANT, or spaCy 3 support is otherwise added (hard to imagine Chris doing that, but maaaybe).

So I'm not saying we drop support for spaCy 2. I was suggesting more that we do what spaCy does, and have 2 branches, a 2.0.x branch and a 3.0.x branch. Instead of having the 2/3 code mixed up in one (somewhat confusing) codebase, we have 2 simpler branches?

The obvious objection is "that is double the work" but I think it is not. If you look at the recent PRs, most work is adding stuff to v3, and tweaking default match hooks. If we call default match hooks done, then any match hook code is going somewhere else / in a file that can be the same across branches. and the v3 work would obviously only need to happen on the v3 branch.

sam-writer commented 3 years ago

Also, I don't want to block the feature downstream of this forever, so I guess I'm okay with merging this and postponing this conversation. I just don't want to keep this up forever, I feel like a one-time effort of separating out all the 3 code, at this point, is warranted and will make maintenance easier.

manhal-daaboul commented 3 years ago

splitting spacy 2 & 3 code https://github.com/writerai/replaCy/tree/spacy-v2.x https://github.com/writerai/replaCy/tree/spacy-v3.x