writer / replaCy

spaCy match and replace, maintaining conjugation
https://pypi.org/project/replacy/
MIT License
34 stars 8 forks source link

extract Scorer #85

Open sam-writer opened 4 years ago

sam-writer commented 4 years ago

KenLMScorer is fantastic. Just so useful. However, it isn't core to replaCy and should be a custom pipeline component (that we expect most people to use... think like en_core_web_sm is for spaCy - a separate installation, but in all the docs) that is separately installable.

I think what using our current pipeline should look like, after extraction, is:

import en_core_web_sm
from replacy.components import MaxCountFilter
from replacy_kenlm_scorer import KenLMScorer
from spacy.utils import filter_spans

replaCy = ReplaceMatcher(en_core_web_sm.load(), etc...)
replaCy.add_pipe("span_filter", filter_spans, first=True)
replaCy.add_pipe("scorer", KenLMScorer(model_or_path), after="span_filter)
replaCy.add_pipe("max_count_filter", MaxCountFilter(defaults...), after="scorer")
sam-writer commented 4 years ago

this component should have the biggest KenLM model we can fit in and still have PyPi allow it... but we could also have instructions that you can curl -O 'https://master.dl.sourceforge.net/project/openccg/data/gigaword4.5g.kenlm.bin' (or even wrap that in a

from replacy_kenlm_scorer import KenLMScorer

klm = KenLMScorer.download_gigaword()