Open alecuba16 opened 2 years ago
Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?
It is theoretically possible. You'll need to implement some wrapper around a list that will correspond to two requirements:
reset_encoding(encoding_name)
which works in the middle of iteration and makes the next lines in different encoding. Once you have this, you can just:
aff, context = spylls.hunspell.readers.read_aff(MyReader(af_lines_list))
dic = spylls.hunspell.readers.read_dic(MyReader(dic_lines_list), aff=aff, context=context)
dictionary = spylls.hunspell.Dictionary(aff, dic)
Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?
Not that convenient, but
from spylls.hunspell.algo.capitalization import Type as CapType
dic = spylls.hunspell.Dictionary.from_files('examples/en_US')
for form in dic.lookuper.affix_forms('kittens', captype=CapType.NO):
print(form.stem)
# prints: "kitten"
Thanks for the reply, about the first issue, I was able to populate the dictionary with the method that you have suggested, I have some problems with the encoding of special chars, but is something that I will address the next week.
The second issue, the stemming, I did a test with the code that you have provided, but it seems that there is some import (or library version) that is preventing to pass the captype:
Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)
The complete code snipped is this, ignore the spark udf wrapper:
from pyspark.sql import *
import pyspark.sql.functions as F
import pyspark.sql.types as T
import spylls
from spylls.hunspell.algo.capitalization import Type as CapType
from pyspark import SparkFiles
def pyspark_transform(spark, df):
def hunspell(desc):
if desc:
dic = spylls.hunspell.Dictionary.from_zip(SparkFiles.get("es_ES.zip"))
return [sug for sug in dic.lookuper.affix_forms(desc, captype=CapType.NO)]
else:
return [""]
dic_path="hdfs:///hunspell/es_ES.zip"
spark.sparkContext.addFile(dic_path)
udf_hunspell = F.udf(hunspell, T.ArrayType(T.StringType()))
df=df.withColumn("result",udf_hunspell(F.col("desc")))
return df
Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)
That's very weird! Can you show a full backtrace of an error?
Thanks for the fast reply.
The stacktrace shows a lot of spark garbage that is not informative and the only python related message is the weird one. But looking your message that seems to be something related with the spark environment. I have executed the code in a local instance of python, at the driver side of the spark (pyspark) environment, and it works properly.
So there is something with the python versions of the executors and the imports of the hunspell library that is not being imported or being imported as None I suppose.
I will check that and will come with the solution.
I found the problem, as I suspected the executors' python instance weren't able to install the hunspell library and the import was failing , producing a cascade of scala<->java errors (common in pyspark stacktraces) that was hidding the main problem, I had to log in into the cluster manager to find out that error.
Summarizing, your were totally right and your code can be integrated into a spark UDF, thanks!
Victor, one final question about the stemming process. What is the procedure for stemming accented words like "específicos". It seems that the affix form method requires non accented words I'm right?
thanks!
It should depend on the dictionary only (if the dictionary has accents, they should be properly processed); but with Unicode quirks you never know :)
Hello!
Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?
This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.
Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?