weblyzard / inscriptis

A python based HTML to text conversion library, command line client and Web service.
Apache License 2.0
276 stars 28 forks source link

Presentation of internal spans seems a bit odd #75

Closed mikix closed 1 year ago

mikix commented 1 year ago
>>> print(inscriptis.get_text("fi<span>r</span>st"))
fi r st

>>> print(inscriptis.get_text("fi<b>r</b>st"))
first

I would expect those two examples to match (and look like the <b> example, where it's one word, as that's what a browser shows).

Inscriptis seems to work really well though! Thanks for this software.

mikix commented 1 year ago

Oh I see - I think this is just an indentation setting issue. Extended (the default) yields the result I saw. Standard/strict do not.

I guess I'll close this as user-misunderstanding. Thanks!

AlbertWeichselbraun commented 1 year ago

just in case someone wonders how to implement the strict setting for the example above:

from inscriptis import get_text
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis.model.config import ParserConfig

config = ParserConfig(css=CSS_PROFILES['strict'].copy())
text = get_text('fi<span>r</span>st', config)

print(text)