yichen0831 / opencc-python

OpenCC made with Python
Apache License 2.0
532 stars 66 forks source link

opencc-python Conversion Does Not Match OpenCC #1

Closed Hopkins1 closed 8 years ago

Hopkins1 commented 8 years ago

When running a conversion of "s2twp", the results for opencc-python do not always match those for OpenCC. For example: OpenCC: "一干 " -> "一干 " opencc-python: "一干 " -> "一幹 "

Note: It appears that the opencc-python conversion chain does not honor "group" tag in the configuration file. The chain is [TWVariantsRevPhrases.txt, TWVariantsRev.txt, TWPhrasesRev.txt, TSPhrases.txt, TSCharacters.txt] The chain should be [[TWVariantsRevPhrases.txt, TWVariantsRev.txt], TWPhrasesRev.txt, [TSPhrases.txt, TSCharacters.txt]]

I've made changes to example.py and opencc.py appear to fix the problem The implementation is ~6x faster. Because of the large changes, I've decided to just attach the modified files rather than try creating a branch.

example.py.zip

opencc.py.zip

yichen0831 commented 8 years ago

Thanks for the improvements. As I have only little time for the project, I will use your modifications for the update.

Hopkins1 commented 8 years ago

OK - sounds good. I tested the code on python 2.7 and 3.5 and everything seemed OK.

If there is going to be a version update, I also see that the TWPhrasesIT.txt file in OpenCC has been updated. It could be a good chance to update TWPhrasesIT.txt and TWPrases.txt in this project.