zverok / spylls

Pure Python spell-checker, (almost) full port of Hunspell
https://spylls.readthedocs.io
Mozilla Public License 2.0
282 stars 18 forks source link

.dic and .aff content by param. #19

Open alecuba16 opened 2 years ago

alecuba16 commented 2 years ago

Hello!

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

zverok commented 2 years ago

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

It is theoretically possible. You'll need to implement some wrapper around a list that will correspond to two requirements:

  1. It is iterable, producing pairs of (line number, line)
  2. It has method reset_encoding(encoding_name) which works in the middle of iteration and makes the next lines in different encoding.

Once you have this, you can just:

aff, context = spylls.hunspell.readers.read_aff(MyReader(af_lines_list))
dic = spylls.hunspell.readers.read_dic(MyReader(dic_lines_list), aff=aff, context=context)
dictionary = spylls.hunspell.Dictionary(aff, dic)

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

Not that convenient, but

from spylls.hunspell.algo.capitalization import Type as CapType

dic = spylls.hunspell.Dictionary.from_files('examples/en_US')
for form in dic.lookuper.affix_forms('kittens', captype=CapType.NO): 
  print(form.stem)
# prints: "kitten"
alecuba16 commented 2 years ago

Thanks for the reply, about the first issue, I was able to populate the dictionary with the method that you have suggested, I have some problems with the encoding of special chars, but is something that I will address the next week.

The second issue, the stemming, I did a test with the code that you have provided, but it seems that there is some import (or library version) that is preventing to pass the captype:

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

The complete code snipped is this, ignore the spark udf wrapper:

from pyspark.sql import *
import pyspark.sql.functions as F
import pyspark.sql.types as T
import spylls
from spylls.hunspell.algo.capitalization import Type as CapType
from pyspark import SparkFiles

def pyspark_transform(spark, df):
    def hunspell(desc):
        if desc:
            dic = spylls.hunspell.Dictionary.from_zip(SparkFiles.get("es_ES.zip"))
            return [sug for sug in dic.lookuper.affix_forms(desc, captype=CapType.NO)]
        else:
            return [""]

    dic_path="hdfs:///hunspell/es_ES.zip"
    spark.sparkContext.addFile(dic_path)

    udf_hunspell = F.udf(hunspell, T.ArrayType(T.StringType()))

    df=df.withColumn("result",udf_hunspell(F.col("desc")))  

    return df
zverok commented 2 years ago

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

That's very weird! Can you show a full backtrace of an error?

alecuba16 commented 2 years ago

Thanks for the fast reply.

The stacktrace shows a lot of spark garbage that is not informative and the only python related message is the weird one. But looking your message that seems to be something related with the spark environment. I have executed the code in a local instance of python, at the driver side of the spark (pyspark) environment, and it works properly.

So there is something with the python versions of the executors and the imports of the hunspell library that is not being imported or being imported as None I suppose.

I will check that and will come with the solution.

alecuba16 commented 2 years ago

I found the problem, as I suspected the executors' python instance weren't able to install the hunspell library and the import was failing , producing a cascade of scala<->java errors (common in pyspark stacktraces) that was hidding the main problem, I had to log in into the cluster manager to find out that error.

Summarizing, your were totally right and your code can be integrated into a spark UDF, thanks!

alecuba16 commented 2 years ago

Victor, one final question about the stemming process. What is the procedure for stemming accented words like "específicos". It seems that the affix form method requires non accented words I'm right?

thanks!

zverok commented 2 years ago

It should depend on the dictionary only (if the dictionary has accents, they should be properly processed); but with Unicode quirks you never know :)