[feature request] make Token::SKIP also skip for parser

lublak commented 3 years ago

currently in some sigils there are always the "skipped" data inside:

$lexer->push('\s+', Token::SKIP);
$parser->sigil(0); // returns "(   "

BenHanson commented 1 year ago

@weltling Is this due to mixing UTF-8 with UTF-32? I imagine it would be better to use lexertl UTF-8 iterators to convert the UTF-32 back to UTF-8.

BenHanson commented 1 year ago

@lublak Could you provide a self contained test case for this? Then I can work on a fix.

Thanks

weltling commented 1 year ago

@BenHanson with regard to UTF-8 vs UTF-32 - the conversion currently relies on libstdc++, please see here lib/parle/cvt.hpp . I think Windows is not quite happy there, but that's probably the least of an issue. The conversion between the PHP engine using UTF-8 and lexertl/parsertl using UTF-32 is don at the corresponding places, can grep them by using the macros as in the header. But if you think moving away from the std functionality to the custom one would bring a better result, it's of course open to change.

With this issue, however, i'd second the request to @lublak for a proper reproducer, the provided data doesn't seem enough.

Thanks

BenHanson commented 1 year ago

@weltling Yes, the lexertl utf iterators are platform independent so we should definitely investigate switching over.

If UTF-8 is the preferred Unicode format for PHP, then we don't even need a UTF-32 build. We can discuss this further to decide how you want it to work.

weltling commented 1 year ago

AFAIR the reason to support UTF-32 is that otherwise some regex options are not available. At least these come in question:

https://www.php.net/manual/en/parle.regex.unicodecharclass.php

These seem to be provided standard C++ lib and are UTF-32 only, if i don't err.

Thanks

BenHanson commented 1 year ago

What you can't do is have a lexertl state_machine that takes 8 bit characters in order to use Unicode characters.

However, input text (including rules) can be UTF-8 and it's possible to convert those strings on the fly in C++ using the lexertl iterators.

I can write you a demo if it makes it clearer.

weltling commented 1 year ago

The conversions already happen at the corresponding places, as mentioned with regard to ./lib/parle/cvt.hpp currently using the functionality from <codecvt>. The conversion happens at runtime. It's just with the current design, to have UTF-32 support a special build is required. Apart from that, switching to lexertl iterators instead of codecvt is sure something that can be considered. With PHP side it is always UTF-8, whether lexertl/parsertl internally support UTF-32 or not.

UTF-8 variant is compiled by default for simplicity. Of course it is possible to support UTF-8 only or UTF-32 supporting build, it would probably require rewriting the internals to carry another type of C++ object inside. BTW same is actually with things like Parser vs. RParser and Lexer vs. RLexer clasess - it could be just one class name and be constructed from PHP, just at the time of creating this it seemed simpler to have different class names or different build. Unifying might be worth it or not, usability or performance wise.

Thanks

BenHanson commented 1 year ago

I've now got conversion of UTF-32 to UTF-8 up and running and working for parser dumping.

weltling commented 1 year ago

Very nice. The char32_t issue is what i've seen as well as a blocker for the UTF-32 part.

Thansk

weltling / parle

[feature request] make Token::SKIP also skip for parser #27