Open ysbaddaden opened 1 week ago
The lexer has been refactored (see main
).
The lexer now also tries to report entities within text data, and parameter entities in DTD, and let the parser handle them. This is the "easy" cases.
Entities in attribute values (and parameter entities in entitydecl value) is much more tricky. It feels like an attvalue should be a list of nodes (weird, but :shrug:).
EDIT: No. Only internal (including the predefined) entities are valid in attribute values. Maybe the lexer could still process them by itself :thinking:
Or the methods should yield
the entity to the parser, so it can create a new lexer, parse the entitydecl value, save the parsed nodes as children of the entitydecl so we only parse once (libxml seems to do that) and finally replace the text content into the attvalue.
The point is to transform the push-lexer into a pull-lexer, that is dumb it down a notch, and push more responsibility to the parsers. This may lead to duplication into each parser, at which point we'll see how to reduce it (if need be).
Most of the
#lex_*
methods should become public, and the#tokenize_*(&)
methods be extracted into the actual parsers. Instead there should be a bunch of individual methods. For example:#lex_xmldecl?
that would try to parse<?xml
#lex_doctype?
that would try to parse<!DOCTYPE
#lex_content
to parse the next content token.Not relying on
yield
should also help with the processing of entities; the parsers can decide to process them immediately, or create anEntityReference
node and process them later (on demand).We should also be able to initialize the lexer at an initial line/column so we can reuse it to process entities (that must be valid
#content
) instead of#replacement_text
.