wmo-im / pywcmp

pywcmp provides validation and quality assessment capabilities for the WMO WIS Core Metadata Profile (WCMP)
https://community.wmo.int/activity-areas/wis
Other
9 stars 8 forks source link

KPI-2 and KPI-3 spellchecking #29

Closed tomkralidis closed 3 years ago

tomkralidis commented 3 years ago

Both KPI-2 and KPI-3 provide a point for passing a basic spellcheck.

pywcmp uses spellchecker to perform a basic spellchecker. In both KPI-2 and KPI-3 implementations, we send the title or abtract to pyspellchecker per below (spellcheck passes are empty sets).

>>> from spellchecker import SpellChecker
>>> s = SpellChecker()
>>> s.unknown('this is a test'.split())
set()
>>> s.unknown('this is a test (hi)'.split())
{'(hi)'}
>>> s.unknown('this is a test hi.'.split())
{'hi.'}
>>> s.unknown('this is a sentence. This is another sentence.'.split())
{'sentence.'}

As we can see, words in brackets, or normal sentences (end words with periods) result in false positives.

We need to update the spellchecking strategy and then apply by refacting both KPI-2 and KPI-3 spellchecking into a single function for reuse and consistency.