weblyzard / inscriptis

A python based HTML to text conversion library, command line client and Web service.
Apache License 2.0
276 stars 28 forks source link

exclude header & footer #79

Closed hadifar closed 9 months ago

hadifar commented 10 months ago

Thanks for your awesome library. How can I exclude header & footers (for every website)?

AlbertWeichselbraun commented 10 months ago

This is a non-trivial task that requires specialized tools, rather than an HTML to text conversion library.

You could either

  1. clean up the obtained text representation (which is easy, if the headers/footers stay constant).
  2. apply technologies such as boiler-plate removal, which is described in the following paper: Lang, Heinz-Peter, Wohlgenannt, Gerhard and Weichselbraun, Albert. (2012). “TextSweeper - A System for Content Extraction and Overview Page Detection”. International Conference on Information Resources Management (Conf-IRM), Vienna, Austria; http://eprints.weblyzard.com/55/1/lang2012-textSweeper.pdf
    1. for more complex use cases such as Web forums you would use content extraction techniques such as HARVEST: Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia; https://arxiv.org/pdf/2102.02240