yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
879 stars 112 forks source link

Unable to parse "special" characters like £¤¥¦§ #265

Closed iain-waugh closed 1 year ago

iain-waugh commented 1 year ago

I have found that cpp-peglib does not seem to parse characters higher up in the ASCII character set (character 160 and above), such as those specified in the VHDL standard library which can be found here.

This happens when I specify the characters manually (as per the standard): other_special_character <- backslash backslash / [!$%^{}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×÷]

or if I have defined character_literal as: character_literal <- < "'" < . > "'" >

With the last example, peglint gives me this error: test_16.3_package_standard.vhd:69:6: syntax error, expecting <character_literal>.

Is this something that is supported?

My PEG is here.

yhirose commented 1 year ago

@iain-waugh, thanks for the report, but I don't fully understand this situation. Could you make the smallest possible PEG grammar and input text, so that I can easily reproduce it? Thanks!

iain-waugh commented 1 year ago

Using this grammar:

# PEG for any thing like: 'x'
sample_peg <-  Spacing? character_literal+ EndOfFile
Spacing <- Space*
Space <- ' ' / '\t' / EndOfLine
EndOfLine <- '\r\n' / '\n' / '\r'
EndOfFile <- !.
~_     <- Spacing
%whitespace <- _

character_literal <- < "'" . "'" >

If you try to parse the attached file, you get this: literals.txt:1:10: syntax error, expecting <character_literal>. literals.txt

The literals.txt file is just: 'A' 'b' '�' 'D'

It can be tricky to create this file with some text editors; a hex dump of it shows character 0xA0 for the non-breaking space. 27 41 27 20 27 62 27 20 27 A0 27 20 27 44 27 0D

iain-waugh commented 1 year ago

It looks like it's a problem because it's saved as an ANSI file. When I re-create it as UTF-8, it works. literals2.txt

27 41 27 20 27 62 27 20 27 C2 A0 27 20 27 44 27

iain-waugh commented 1 year ago

VHDL standard libraries are in ANSI format and they make use of these characters with ASCII codes at 160 and above. Is this something you can fix in cpp-peglib? The workaround is to require files with these higher ASCII characters to be presented in UTF-8 files.

yhirose commented 1 year ago

@iain-waugh I now understand what you mean. Unfortunately, cpp-peglib accepts only UTF-8 text. Since I only need to car for UTF-8 text, I am not planning to support other character encodings like ANSI, SHIFT-JIS or so on. https://github.com/yhirose/cpp-peglib#unicode-support

Encoding related functions are below. In order to accept ANSI text, those should be modified. https://github.com/yhirose/cpp-peglib/blob/master/peglib.h#L78-L207

Sorry that I can't give you much help...