zorojean / cx-extractor

Automatically exported from code.google.com/p/cx-extractor
0 stars 0 forks source link

损耗时间的一步 #4

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
source = links.matcher(source).replaceAll("");

样例:http://news.itxinwen.com/2013/0802/515691.shtml

单是这一步 将耗时90s+

建议:可以直接通过source = source.replaceAll("<[^>]+>", "");  
移除所有Tag?

Original issue reported on code.google.com by ywq1...@gmail.com on 2 Aug 2013 at 8:01

GoogleCodeExporter commented 8 years ago
private static Pattern links = Pattern.compile("<[^>]+>.*?</[aA]>");

考虑到<a>contents<a>这样更好些

唯一的缺陷是 如果正文有带有超链接的文字段也将被删除了

Original comment by ywq1...@gmail.com on 2 Aug 2013 at 9:57