yanyiwu / cppjieba

"结巴"中文分词的C++版本
MIT License
2.61k stars 690 forks source link

Dictionaries in this repo vs dictionaries in https://github.com/fxsjy/jieba #188

Closed sanikolaev closed 1 month ago

sanikolaev commented 2 months ago

I want to thank the maintainers of this library for their hard work. We are currently integrating it into Manticore Search (rel. issue https://github.com/manticoresoftware/manticoresearch/issues/931), and I have a question about the dictionaries. How are the dictionaries in this repo (https://github.com/yanyiwu/cppjieba/tree/master/dict) different from the ones in the Jieba repository (https://github.com/fxsjy/jieba/tree/master/extra_dict)?

The formats seem to be the same. How should one decide which dictionary to use?

Translation

抱歉,我不会说中文,但我看到这个仓库里大家通常用中文交流,所以下面是我问题的自动翻译。

我要感谢这个库的维护者们的辛勤工作。我们目前正在将它集成到 Manticore Search 中(相关问题:https://github.com/manticoresoftware/manticoresearch/issues/931),我有一个关于词典的问题。这个仓库中的词典(https://github.com/yanyiwu/cppjieba/tree/master/dict)与 Jieba 仓库中的词典(https://github.com/fxsjy/jieba/tree/master/extra_dict)有什么不同

它们的格式似乎是一样的。我们应该如何决定使用哪个词典呢?

yanyiwu commented 1 month ago

The expectations are basically the same, and you can choose according to your preference.

sanikolaev commented 1 month ago

Thanks for your response.

you can choose according to your preference

The issue is that I can't really prefer one over the other because I don't speak Chinese and haven't used either of these dictionaries before. I just noticed that one is about 5MB, while the other one is 8MB. That seems like a significant size difference for dictionaries. Does this mean that the larger dictionary will result in better segmentation?

yanyiwu commented 1 month ago

Yes. Generally speaking, a larger dictionary does lead to better word segmentation results.

sanikolaev commented 1 month ago

larger dictionary does lead to better word segmentation results

Thank you. Closing the issue.