xjl219 / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Documentation - How to output html extract fragement instead of text? #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Fantastic tool, been wondering how to output html extract fragment instead of 
text? Similar to what the appspot app uses.

Original issue reported on code.google.com by gyorgy.c...@gmail.com on 20 Nov 2011 at 3:47

GoogleCodeExporter commented 9 years ago
Same here.. Can you plz let us know if this is possible to be done from the 
api? 

cheers and congrats for the excelence work!

D.

Original comment by Dimitris...@gmail.com on 23 Nov 2011 at 11:18

GoogleCodeExporter commented 9 years ago
I searched litle bit more and i found the solution:

1)if your input is a string
private final BoilerpipeExtractor extractor = 
CommonExtractors.DEFAULT_EXTRACTOR;
private final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
InputSource is = new InputSource(new StringReader(detailPageSourceCode));
 final TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
extractor.process(doc);
StringBuilder bf = new StringBuilder();
bf.append("<meta http-equiv=\"Content-Type\" content=\"text-html; 
charset=utf-8\" />");
bf.append(hh.process(doc, detailPageSourceCode));

2)if your input is a URL(taken from HTmlHighlighterDemo.java)

URL url = new URL(
                "http://research.microsoft.com/en-us/um/people/ryenw/hcir2010/challenge.html"
//              "http://boilerpipe-web.appspot.com/"
                );

        // choose from a set of useful BoilerpipeExtractors...
        final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
        // choose the operation mode (i.e., highlighting or extraction)
            final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();

        PrintWriter out = new PrintWriter("/tmp/highlighted.html", "UTF-8");
        out.println("<base href=\"" + url + "\" >");
        out.println("<meta http-equiv=\"Content-Type\" content=\"text-html; charset=utf-8\" />");
        out.println(hh.process(url, extractor));
        out.close();

Cheers

Original comment by Dimitris...@gmail.com on 24 Nov 2011 at 11:01

GoogleCodeExporter commented 9 years ago
That's the correct solution (= HTMLHighlighterDemo.java).

Original comment by ckkohl79 on 24 Nov 2011 at 5:44

GoogleCodeExporter commented 9 years ago
Can anyone tell me how to output JSON

Original comment by waelmiladi on 26 Sep 2012 at 9:05