wikimedia / revscoring

A generic, machine learning-based revision scoring system for MediaWiki
https://revscoring.readthedocs.io
MIT License
89 stars 51 forks source link

aspell installation for MacOS #498

Open paulkernfeld opened 4 years ago

paulkernfeld commented 4 years ago

While running the sample MacOS installation script for the aspell dictionaries, I hit this:

aspell-pt-0.50-2/portugu�s.alias: Can't create 'aspell-pt-0.50-2/portugu�s.alias'

This looks related to this bug, which was apparently fixed in aspell 0.51.1-5. However, I can't verify that that works, because aspell for language pt is only available in versions 0.50-1 and 0.50-2. I think this is because pt hasn't been updated since 2003, and seems to have been replaced by separate pt_PT and pt_BR dictionaries.

I was able to install Portuguese that way, but it doesn't seem like revscoring is picking it up. Looking at revscoring/languages/portuguese.py, it seems like it's trying to use dictionaries from myspell rather than aspell. I'm not too familiar with the spell-checking software ecosystem, so I think I could benefit from some advice on which way to go. Should I be trying to install Portuguese for aspell, or should I try to use myspell or hunspell? And where should I get the dictionaries from? I did track down the myspell-pt-pt package for xenial... should I maybe just install the dictionary files directly from the Ubuntu package? That seems a little funky so I wanted to stop and ask for advice in case I'm making the problem overly complicated 😂

halfak commented 4 years ago

What error are you getting after you install the pt_PT and pt_BR packages? In theory aspell, myspell, and hunspell should all work with pyenchant.

paulkernfeld commented 4 years ago

Yeah, I probably should have posted the actual error 😁

Here's what I'm seeing:

(/Users/paul/repos/revscoring/venv) [15:44:2]% pytest
========================================================= test session starts =========================================================
platform darwin -- Python 3.7.9, pytest-6.0.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/paul/repos/revscoring, configfile: pytest.ini
plugins: cov-2.10.1
collected 294 items / 26 errors / 268 selected

=============================================================== ERRORS ================================================================

(Similar errors for many languages)

_________________________________________ ERROR collecting tests/languages/test_portuguese.py _________________________________________
ImportError while importing test module '/Users/paul/repos/revscoring/tests/languages/test_portuguese.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
revscoring/languages/features/dictionary/util.py:20: in load_dict
    return enchant.Dict(dict_name)
venv/lib/python3.7/site-packages/enchant/__init__.py:562: in __init__
    _EnchantObject.__init__(self)
venv/lib/python3.7/site-packages/enchant/__init__.py:168: in __init__
    self._init_this()
venv/lib/python3.7/site-packages/enchant/__init__.py:569: in _init_this
    this = self._broker._request_dict_data(self.tag)
venv/lib/python3.7/site-packages/enchant/__init__.py:310: in _request_dict_data
    self._raise_error(eStr % (tag,),DictNotFoundError)
venv/lib/python3.7/site-packages/enchant/__init__.py:258: in _raise_error
    raise eclass(default)
E   enchant.errors.DictNotFoundError: Dictionary for language 'pt_PT' could not be found

During handling of the above exception, another exception occurred:
venv/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/languages/test_portuguese.py:5: in <module>
    from revscoring.languages import portuguese
revscoring/languages/portuguese.py:8: in <module>
    load_dict('pt_PT', 'myspell-pt-pt'),
revscoring/languages/features/dictionary/util.py:24: in load_dict
    "Consider installing {1!r}").format(dict_name, target_package))
E   ImportError: No enchant-compatible dictionary found for 'pt_PT'.  Consider installing 'myspell-pt-pt'

(Similar errors for many languages)

========================================================== warnings summary ===========================================================
venv/lib/python3.7/site-packages/boto/plugin.py:40
  /Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/boto/plugin.py:40: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

venv/lib/python3.7/site-packages/scipy/sparse/sparsetools.py:21
  /Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
  scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
    _deprecated()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================================= short test summary info =======================================================
ERROR tests/languages/test_arabic.py
ERROR tests/languages/test_basque.py
ERROR tests/languages/test_catalan.py
ERROR tests/languages/test_croatian.py
ERROR tests/languages/test_czech.py
ERROR tests/languages/test_dutch.py
ERROR tests/languages/test_estonian.py
ERROR tests/languages/test_galician.py
ERROR tests/languages/test_greek.py
ERROR tests/languages/test_hebrew.py
ERROR tests/languages/test_hungarian.py
ERROR tests/languages/test_icelandic.py
ERROR tests/languages/test_indonesian.py
ERROR tests/languages/test_italian.py
ERROR tests/languages/test_latvian.py
ERROR tests/languages/test_norwegian.py
ERROR tests/languages/test_persian.py
ERROR tests/languages/test_polish.py
ERROR tests/languages/test_portuguese.py
ERROR tests/languages/test_romanian.py
ERROR tests/languages/test_russian.py
ERROR tests/languages/test_serbian.py
ERROR tests/languages/test_spanish.py
ERROR tests/languages/test_swedish.py
ERROR tests/languages/test_ukrainian.py
ERROR tests/languages/test_vietnamese.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 26 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

My output of aspell dump dicts is:

am
ar
ast
az
be
be_BY
be_SU
bg
bn
br
ca
ca-general
ca-valencia
cs
csb
cy
da
de
de-alt
de_AT
de_CH
de_DE
el
en
en-variant_0
en-variant_1
en-variant_2
en-w_accents
en-wo_accents
en_AU
en_AU-variant_0
en_AU-variant_1
en_AU-w_accents
en_AU-wo_accents
en_CA
en_CA-variant_0
en_CA-variant_1
en_CA-w_accents
en_CA-wo_accents
en_GB
en_GB-ise
en_GB-ise-w_accents
en_GB-ise-wo_accents
en_GB-ize
en_GB-ize-w_accents
en_GB-ize-wo_accents
en_GB-variant_0
en_GB-variant_1
en_GB-w_accents
en_GB-wo_accents
en_US
en_US-variant_0
en_US-variant_1
en_US-w_accents
en_US-wo_accents
eo
es
et
fa
fa-common
fa-generic
fa-scientific
fi
fo
fr-40
fr
fr-60
fr-80
fr-lrg
fr-med
fr-sml
fr_CH-40
fr_CH-60
fr_CH
fr_CH-80
fr_CH-lrg
fr_CH-med
fr_CH-sml
fr_FR-40
fr_FR
fr_FR-60
fr_FR-80
fr_FR-lrg
fr_FR-med
fr_FR-sml
fy
ga
gd
gl
gl-minimos
gr
grc
gu
gv
he
hi
hil
hr
hsb
hu
hus
hy
ia
id
it
kn
ku
ky
la
lt
lv
mg
mi
mk
ml
mn
mr
ms
mt
nds
nl
nn
ny
or
pa
pl
pt_BR
pt_PT
qu
ro
ro-classic
ru
ru-ye
ru-yeyo
ru-yo
rw
sc
sk
sk_SK
sl
sr
sr-cyrl
sr-latn
srd
sv
sw
ta
te
tet
tk
tl
tn
tr
uk
uz
vi
wa
yi
zu
halfak commented 4 years ago

In a terminal, can you try:

$ python
>>> import enchant
>>> dict = enchant.Dict("pt_PT")

That is roughly what we're doing in python to get access to the dict. I agree that what we see with aspell dump dicts suggests this should work. I bet the error will be informative.

paulkernfeld commented 4 years ago

Here's what I have for that:

(/Users/paul/repos/revscoring/venv) [15:47]% python
Python 3.7.9 (default, Aug 31 2020, 07:22:35)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import enchant
>>> d = enchant.Dict("pt_PT")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/enchant/__init__.py", line 562, in __init__
    _EnchantObject.__init__(self)
  File "/Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/enchant/__init__.py", line 168, in __init__
    self._init_this()
  File "/Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/enchant/__init__.py", line 569, in _init_this
    this = self._broker._request_dict_data(self.tag)
  File "/Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/enchant/__init__.py", line 310, in _request_dict_data
    self._raise_error(eStr % (tag,),DictNotFoundError)
  File "/Users/paul/repos/revscoring/venv/lib/python3.7/site-packages/enchant/__init__.py", line 258, in _raise_error
    raise eclass(default)
enchant.errors.DictNotFoundError: Dictionary for language 'pt_PT' could not be found
Simonmaignan commented 2 years ago

Hi @paulkernfeld , @halfak and @accraze I am also a bit stuck and puzzled by this error.

I installed aspell from homebrew. It installed this version

@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)

My aspell dict dump returns the same list as the one given by @paulkernfeld (containing pt_PT and pt_BR).

If I install the last pyenchant version 3.2.2 (using pip install pyenchant), I am able to import the 'pt_PT' enchant dict like @halfak mentioned above.

However, when installing the pyenchant version required as dependency of revscoring (pyenchant version 1.6.11), the pt_PT dict cannot be found. I don't know what is the difference between the 2 pyenchant version, but it seems the version 1.16.11 doesn't look for the aspell dictionaries.

I cannot find any documentation about the apt-get install myspell-pt-pt used for the Ubuntu installation that would translate to install the same dict on MacOS. I tried installing hunspell, but it doesn't seem to contain any pt dictionaries.

@halfak , @accraze , is there a reason why we don't use a more recent version of pyenchant as dependency of revscoring?

halfak commented 2 years ago

Interesting! We could just bump the version of pyenchant. It's only for incidental reasons that we haven't done that already.