zytedata / html-text

MIT License
7 stars 0 forks source link

Text extraction from comments fails #3

Closed lopuhin closed 4 months ago

lopuhin commented 4 months ago

It's possible to pass lxml.html nodes to html_text.extract_text like this:

>>> html_text.extract_text(lxml.html.fragment_fromstring('<p>foo</p>'))
'foo'

But this fails if a node happens to be an HtmlComment:

>>> html_text.extract_text(lxml.html.fragment_fromstring('<!-- comment -->'))
...
AttributeError: 'HtmlComment' object has no attribute 'strip'

probably in this situation it would be best to return an empty string.

The use-case for this is that if we walk some HTML tree and call html_text.extract_text on sub-elements, it would be nice to not fail.

kmike commented 4 months ago

What's interesting is that we do have tests for comments extraction, see https://github.com/zytedata/html-text/blob/c573a7e0e5c883161aac970bc9cc231e36e719d0/tests/test_html_text.py#L51. Let me look into this.

lopuhin commented 4 months ago

Thanks @kmike , I think the difference compared to the test is passing a string vs passing an HTML comment element, the error happens because we don't handle the HtmlComment case, so hopefully it's only the input validation which needs to change.