分词后高亮显示bug

zero-hero-he / paoding

Automatically exported from code.google.com/p/paoding

0 stars 0 forks source link

分词后高亮显示bug #57

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago


   背景：文本信息包含H1N1，用paoding分词，检索的结果高亮显示
   问题：原文本包含H1N1最后高亮显示后都变成了H1NH1N1。如果换成lucene自带的
StandardAnalyzer是没有这个问题的。

Original issue reported on code.google.com by gengbo1...@gmail.com on 10 Mar 2010 at 7:41

GoogleCodeExporter commented 9 years ago

reno，是不是我们在分词上有问题？

Original comment by qieqie.wang on 18 Mar 2010 at 2:32

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

不是分词的问题，lucene自带的高亮是不能正确处理分词重叠��
�情况的，如 ABC => 
AB/BC ，高亮就会显示成ABBC
但standaryanlyzer的分词因为不存在重叠的情况，所以没这个问��
�。
所以问题的根源是lucene的高亮是处理西方语言的，而西方语��
�不存在重叠的情况。
在paoding新版本的发布时，会提供一个Lucene 
highlighter中文高亮的fix版本，会解决这个问题。

Original comment by reno....@gmail.com on 18 Mar 2010 at 2:49

GoogleCodeExporter commented 9 years ago

非常感谢，希望paoding的支持lucene3.0的包和支持高亮的包尽快�
��供。

Original comment by gengbo1...@gmail.com on 19 Mar 2010 at 8:24

GoogleCodeExporter commented 9 years ago

非常期待，2010年我一直在等待。
高亮存在如下问题：
    1、符号被过滤掉
    2、出现重叠

Original comment by stt...@163.com on 10 Aug 2010 at 6:51

GoogleCodeExporter commented 9 years ago

嗯，这些问题是存在的，reno告诉我已经fix到svn库中，只是我�
��有提供打包好的下载版。

Original comment by qieqie.wang on 10 Aug 2010 at 6:57

GoogleCodeExporter commented 9 years ago

符号好像不被过滤掉，但是过滤词是不会高亮显示，并且重��
�的问题好像未解决掉，如：
检索：关于做好220kV
关于做好220220kV增棠甲乙线增容改造及永和开发区迁改期间电
网安全及电力供应工作的通知

检索：关于组织收看《广州市学习实践科学发展观活动专题��
�告会》的通知
关于组织收看《广州市学习实践科学发展观活动专题广州市��
�习实践科学发展观活动专题报告会》的通知 

检索：220kv
关于印发《广州供电局110kV～220220kV高压设备SF6气体湿度带电�
��试工作管理规定》的通知

Original comment by stt...@163.com on 10 Aug 2010 at 2:25

Attachments:

高亮重叠.doc

GoogleCodeExporter commented 9 years ago

重叠词分词高亮的问题，lucene已解决了，详见https://issues.apach
e.org/jira/browse/LUCENE-627?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12421332
但还是有问题，个人认为是庖丁分词输出词序的问题导致的��
�
举例：
词库中包含词语“因为”“为爱”“爱”“爱情”...
doc=因为爱
关键词=因为为爱，分词结果=因为 爱 为爱
高亮结果是：<B>因为</B><B>为爱</B>

如果把词库的“爱”去掉，则高亮结果正确。

Original comment by cn.yan...@gmail.com on 21 Apr 2011 at 3:12