Open lublak opened 3 years ago
@weltling Is this due to mixing UTF-8 with UTF-32? I imagine it would be better to use lexertl UTF-8 iterators to convert the UTF-32 back to UTF-8.
@lublak Could you provide a self contained test case for this? Then I can work on a fix.
Thanks
@BenHanson with regard to UTF-8 vs UTF-32 - the conversion currently relies on libstdc++
, please see here lib/parle/cvt.hpp
. I think Windows is not quite happy there, but that's probably the least of an issue. The conversion between the PHP engine using UTF-8 and lexertl/parsertl using UTF-32 is don at the corresponding places, can grep them by using the macros as in the header. But if you think moving away from the std functionality to the custom one would bring a better result, it's of course open to change.
With this issue, however, i'd second the request to @lublak for a proper reproducer, the provided data doesn't seem enough.
Thanks
@weltling Yes, the lexertl utf iterators are platform independent so we should definitely investigate switching over.
If UTF-8 is the preferred Unicode format for PHP, then we don't even need a UTF-32 build. We can discuss this further to decide how you want it to work.
AFAIR the reason to support UTF-32
is that otherwise some regex options are not available. At least these come in question:
https://www.php.net/manual/en/parle.regex.unicodecharclass.php
These seem to be provided standard C++ lib and are UTF-32 only, if i don't err.
Thanks
What you can't do is have a lexertl state_machine that takes 8 bit characters in order to use Unicode characters.
However, input text (including rules) can be UTF-8 and it's possible to convert those strings on the fly in C++ using the lexertl iterators.
I can write you a demo if it makes it clearer.
The conversions already happen at the corresponding places, as mentioned with regard to ./lib/parle/cvt.hpp
currently using the functionality from <codecvt>
. The conversion happens at runtime. It's just with the current design, to have UTF-32 support a special build is required. Apart from that, switching to lexertl
iterators instead of codecvt
is sure something that can be considered. With PHP side it is always UTF-8, whether lexertl/parsertl internally support UTF-32 or not.
UTF-8 variant is compiled by default for simplicity. Of course it is possible to support UTF-8 only or UTF-32 supporting build, it would probably require rewriting the internals to carry another type of C++ object inside. BTW same is actually with things like Parser vs. RParser and Lexer vs. RLexer clasess - it could be just one class name and be constructed from PHP, just at the time of creating this it seemed simpler to have different class names or different build. Unifying might be worth it or not, usability or performance wise.
Thanks
I've now got conversion of UTF-32 to UTF-8 up and running and working for parser dumping.
Very nice. The char32_t
issue is what i've seen as well as a blocker for the UTF-32 part.
Thansk
currently in some sigils there are always the "skipped" data inside: