xexyl / jparse

JSON parser, library and tools written in C
3 stars 3 forks source link

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

Closed lcn2 closed 1 month ago

lcn2 commented 2 months ago

Is there an existing issue for this?

Describe the bug

Certain strings seem to not be decoded properly by the jstrdecode(1) tool, or so it seems.

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

Some background

Consider this JSON document containing just a JSON string:

"œßåé"

When you enter that JSON document into https://jsonlint.com/ is shows that the JSON is valid.

Moreover, the jparse(1) tool also shows that the JSON document is valid:

$ jparse -J 3 -s '"œßåé"'
JSON tree[3]:   lvl: 0  type: JTYPE_STRING  len{p,c:q}: 8   value:  "\xc5\x93\xc3\x9f\xc3\xa5\xc3\xa9"
in parse_json(): JSON debug[1]: valid JSON

it is GOOD that the JSON parser is happy with this JSON document.

FYI: Here is the hexdump(1) shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):

$ echo '"œßåé"' | hexdump -C
00000000  22 c5 93 c3 9f c3 a5 c3  a9 22 0a                 |"........".|
0000000b

Here is how jstrencode(1) processes the string:

$ jstrencode '"œßåé"'
\"œßåé\"

This is CORRECT as double quotes are back-slashed (i.e., \") is correct because within a JSON string, all double quotes MUST be backslashed.

Now if we use -Q, the enclosing double quotes are ignored:

$ jstrencode -Q '"œßåé"'
œßåé

If we consult https://codebeautify.org/json-encode-online, and we put "œßåé"' on the input side, we see that output side shows the same string.

Evidence of a decoding bug

Now, some JSON tools, such as jsp (see https://github.com/kjozsa/jsp) encodes "œßåé"' as:

$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"

When we put the "\u0153\u00df\u00e5\u00e9" string into the input side of https://codebeautify.org/json-encode-online, the output side shows "œßåé".

However if we give the "\u0153\u00df\u00e5\u00e9" string to jstrdecode(1), we get something odd:

$ jstrdecode "\u0153\u00df\u00e5\u00e9"
S���

Using hexdump(1) we see that the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

However this is what jstrdecode(1) produces:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

This suggests that the JSON string decoding may be incorrect.

What you expect

We expect that the output of

jstrdecode "\u0153\u00df\u00e5\u00e9"

to be the same as:

echo 'œßåé`

Environment

- OS:SO
- Device:eciveD
- Compiler:relipmoC

bug_report.sh output

n/a

Anything else?

Consider the result of using -v 3:

$ jstrdecode -v 3 "\u0153\u00df\u00e5\u00e9"
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: processing arg: 0: <\u0153\u00df\u00e5\u00e9>
debug[3]: arg length: 24
debug[3]: decode length: 5
S���

The debug messages look OK, but the output is not:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

which is not the same as the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

So it appears that strings with various \u<hex><hex><hex><hex> might not be decoded correctly.

UPDATE 0C

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

UPDATE 0D - todo list

It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.

Final todo:

I believe that with those done, as long as things look good, we can mark this as complete.

Side todo:

lcn2 commented 2 months ago

How and why we filed 3 bugs reports

We found something that seemed odd regarding JSON string encoding and decoding. It relates to the bin tools from the "Other repo" that we are working on.

While investigating, we discovered the issue with the man page (issue #11). We would have simplex fixed the man page, but while we investigating further we found the issue with the quotes (issue #12).

The encode / decode puzzled us. Stings with wide variety of non-ASCII characters (like "àßç") seemed to be valid within JSON strings without the need to "encode" them as "\u". However other JSON implementations seemed to do just that.

It appears that among the multiple "bogons" of the JSON spec is that there is a "choice": that one can use "é" or one can use "\uc3a9" (because é in UTF-8 is c3 a9). Both are valid ways to refer to the same thing. In fact, one can refer to the ASCII "a" or to it as "\u0061" (ASCII "a" is 0x61). One cannot compare two JSON strings by a simple byte comparison. One MUST decode the JSON string and then compare byte for byte (grown / sigh).

So that is how we started trying to use jstrencode(1) and jstrdecode(1) to play around encoding and decoding. That is how we discovered what seems to be a decode issue (issue #13).

It is a good thing we did discover these issues now. They do block a fix/improvement to how the "other repo" will be able to render web pages that contain non-ASCII characters (such as in an author's name or an award title). Those happen to work because the JSON so-called spec chose to render the symbols. Had someone produced a string using with the equivalent "\u", things would have been broken (at least in terms of how the web pages would have looked).

Anyway we wanted you to know why we were looking into JSON string encoding/decoding, and why we filed these issues.

Now we will return back to the bin tools (we are nearly done, but for these issues and the need for another tool) so then we can return to the submit server, get the screen snapshots and completed one of the TODOs needed for the Great Fork Merge.

lcn2 commented 2 months ago

a proposed 2 new options to jstrencode

How should the jstrencode(1) and jstrdecode(1) tools deal with these two ways to represent a symbol?

By default, jstrencode(1) would encode symbols as themselves unless they were one of the symbols that must be backslash-escaped (such as the double-quote, the newline, the NUL byte, etc.).

With jstrencode -a, every symbol must be encoded as \u<hex><hex><hex><hex> according to the 2-byte encoded hex value. Here -a stands for "all".

NOTE: With -a, the double-quote, instead of being backslash-escaped as \" would he encoded as \u0022 (the double-quote symbol has the value of 0x22). And lower-case a would be encoded as \u0061 (a has the value of 0x61).

With jstrencode -8, any symbol with a value > 127 MUST be encoded as \u<hex><hex><hex><hex> according to the 2-byte encoded hex value. Here -8 stands for "encode if value is >= 2^8".

Of course, the jstrdecode(1) tool will always decode \u<hex><hex><hex><hex> back into the original symbol, so -a and -8 do not apply to that tool.

UPDATE 0

Do we need to encode at 2-byte UTF-8 or ???

On the one hand, ß is UTF-8 c3 9f, so \uc39f.

On the other hand, ß has a byte value of 0xdf, so \u00df. We DO have examples of external applications encoding characters thus way, so at a minimum jstrdecode(1) needs to be able to decode \u00df back to ß.

Which is needed for JSON encoding?

UPDATE 1

According to RFC 4627 Section 3:

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Does that provide the vital clue?

xexyl commented 2 months ago

Environment

- OS:SO
- Device:eciveD
- Compiler:relipmoC

:-)

BTW: did you ever notice that the word device is in deceive if you rearrange letters a bit?

xexyl commented 2 months ago

I am waking up more now. The ac might be helping there. Unfortunately I am away from the computer for a little bit.

Your use of Eszett did make me check. It seems that's missing from the UTF-8 map in the mkiocccentry repo.

If I can determine the right \u value I can add it when back but since it was here I might finish the writing of my comment here (that I believe I am almost done writing but I do have a bit more to say when back).

Some of these bug reports can be done easily though with the new options it will of course take more time.

UPDATE 0

Oh! You do have it .. sharp S but you don't have it as ss which it should be. I'll fix that.

UPDATE 1

Eszett fixed in https://github.com/ioccc-src/mkiocccentry/pull/971/commits/f1cb9e64225594a6a7d4696988bfa86806865cdd.

xexyl commented 2 months ago

Okay now back to more serious details.

Caveat

I am very tired so I might be missing something with the thought below. If so I'll either see it later when more alert/awake/clear-headed or maybe what I write will make you think of something.

With that out of the way...

First things first (actually this is the second thing :-) or third if you count the above comment first)

It seems like it might be a good idea to figure out what has to map to what. And since there are so many characters the question I have is if we need a table like struct encode jenc[] found in json_parse.c. But maybe it's more like the table struct utf8_posix_map hmap[] in mkiocccentry's soup/utf8_posix_map.c. Or perhaps it's kind of a combination of the two?

I am not sure how this might go though due to the way the enc table works in particular it being 8-bit values.

But if we need to encode the Eszett (again as per your example) then what should it be encoded to?

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

At a QUICK look at json_parse.c it seems like the reason jparse.c works is due to the function json_conv_string(). The function that calls that is parse_json_string() and it does so like:

str = json_conv_string(string, len, true);

and the boolean indicates:

 *      quote   true ==>  ignore JSON double quotes: both ptr[0] & ptr[len-1]
 *                        must be '"'

Of course it's entirely possible that at this quick glance I am parsing :-) this wrong :-( but I wonder if this might be something to look at. If it is how the parser does it (the parse_json_string() is in jparse.y so it seems likely) then the question is if that's why jparse gets it right. If it does maybe it's something to consider? Do we need (in this case at least?) the json_decode()? I'd say it's not as simple as replacing it given what that function also does but the question is what is different here?

Again this is assuming that I'm not missing something obvious at my quick glance.

UPDATE 0

I was thinking of the reverse step so the above is not going to work even if it was as simple as swapping (with some modifications) it ... which it would not be almost certainly. Even so it's interesting that that function does seem to work okay and the decode one does not.

xexyl commented 2 months ago

Ah .. but of course it's the reverse. The parser does the opposite. So it definitely is not so simple. I wonder if something similar can be done though? I'm not sure. I'm hoping I can read all three bug reports more carefully soon. But I also wonder about the other repo's is_available.sh script we were talking about me writing. I might do that sometime soon. But I do have other things I have to do today too.

xexyl commented 2 months ago

QUESTION - priority wrt the IOCCC

As I have other things I have to do including over at mkiocccentry but also things unrelated to programming, I just wanted to ask what the priority of this issue is versus the IOCCC.

As a programmer, and I'm sure you are very familiar with this, this is an issue that I do not like having, but I also know there is such a thing as prioritising so I am asking this here. I hope to look at the is_available.sh script sometime soon but I have those other things I have to do too so I'm not sure what I'll get done today.

xexyl commented 2 months ago

Another funny thing I noticed is reading from stdin seems wrong. Look:

$ jstrdecode 
foo bar
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

... when I hit ctrl-d obviously. But look:

$ echo foo bar | jstrdecode 
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

What am I missing? I don't recall this behaviour at all. In fact in the man page there's an example where this works.

xexyl commented 2 months ago

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

But I must leave for now possibly for the day.

xexyl commented 2 months ago

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

Actually I wonder if this should be looked at now .. might require a change in how the above is done. But maybe not. I don't know how to even begin to address it however.

xexyl commented 2 months ago

I'm afraid that that bug has to be fixed first in order to properly test this. Trouble is at this time I'm not sure where to begin with it other than checking that warning message (of course). But what that'll show is another matter.

xexyl commented 2 months ago

The problematic code seems to be in json_parse.c:

            /*
             * disallow characters that should have been escaped
             */
            switch (c) {
            case '\b':  /*fallthrough*/
            case '\t':  /*fallthrough*/
            case '\n':  /*fallthrough*/
            case '\f':  /*fallthrough*/
            case '\r':
                /* error - clear allocated length */
                if (retlen != NULL) {
                    *retlen = 0;
                }
                warn(__func__, "found non-\\-escaped char: 0x%02x", (uint8_t)c);
                return NULL;
                break;

But this is coming from sending EOF.

xexyl commented 2 months ago

I am guessing that it is from read_all() not removing the EOF or else the decode function needs to stop at EOF. But of course what if the string has an EOF in the middle? I guess one could check that it's EOF at the end of the string but as I did not write those functions and I am way too tired to be doing this I will have to stop for now and probably not continue until tomorrow.

By all means please advise on what you think there. Of course since there is very possibly a bug in the decode function anyway ...

xexyl commented 2 months ago

This diff:

-    buf = json_decode(input, inputlen, &bufsiz, NULL);
+    buf = json_decode(input, inputlen-1, &bufsiz, NULL);

solves the problem with the EOF being seen in the decode function, at the end. But whether it will cause more problems or not I do not know. What do you think?

xexyl commented 2 months ago

Returning tomorrow, hopefully with a better sleep. Unfortunately (not at all surprising) the excessive heat alert was changed from ending at 2000 hours tonight to 2000 hours tomorrow so that's not fun but hopefully it's not much longer than that. Not that it not being in place does not mean it won't be scorching but any lower is better than not lower or even higher.

xexyl commented 2 months ago

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf.

What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

lcn2 commented 2 months ago

QUESTION - priority wrt the IOCCC

As I have other things I have to do including over at mkiocccentry but also things unrelated to programming, I just wanted to ask what the priority of this issue is versus the IOCCC.

As a programmer, and I'm sure you are very familiar with this, this is an issue that I do not like having, but I also know there is such a thing as prioritising so I am asking this here. I hope to look at the is_available.sh script sometime soon but I have those other things I have to do too so I'm not sure what I'll get done today.

Issue impacts

Issue #13 could be said to impact progress in the Great Fork Merge, as it potentially impacts the non-ASCII characters on the web site in 4 places. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice.

Issue #13 would potentially impact new authors with non-ASCII characters for where there isn't a good ASCII quasi-equivalent. One could say that it impacts the I in IOCCC.

Solving issue #13 might require a fix to some jparse JSON string decoding related code, not just to the jstrdecode(1) tool. That would certainly involve a new release of the mkiocccentry repo: something that should happen before we ask the public to try out the repo code (which is post Great Fork Merge and pre opening of IOCCC28).

Current non-ASCII character impact

Here are the 4 places where non-ASCII characters are found in JSON. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice.

OK, there is a workaround to change the fancy quote character to an ASCII single quote.

While one could change "lámatyávë" into ASCII "lamatyave", but that would be a shame to do so.

One could change "Anton Älgmyr" into ASCII "Anton Algmyr", but that would be a shame to do so.

One could change "Ondřej Adamovský" into ASCII "Ondrej Adamovsky"", but that would be a shame to do so.

Of course, the above 4 places then cause these web pages to be impacted as a result of the above 4 issues:

The 4 "hack-a-rounds" noted above could address the above web pages, but that would be a shame to do so.

Priority

If there is a fix for this decoding problem, then it should be applied before the Great Fork Merge. If that is not possible, then we could force the 4 "hack-a-rounds" noted above for now, and then potentially have to fix it before we ask for a public trial of this repo. We might have to put up a "KNOWN BUG" notice for this repo, however.

Our suggestion is that you look into the issue enough to determine the type of solution needed and to access the time/complexity of the fix. Depending on that you find, we might opt to fix things before the Great Fork Merge, or before the public trial of this repo, or before IOCCC28, or ...

p.s.

We could have made a fix to issue #11 and issue #12, but we didn't want to complicate your looking into issue #13 so we stayed out of a 3 rather than potentially impacting your efforts on the more important issue #13.

lcn2 commented 2 months ago

At a QUICK look at json_parse.c it seems like the reason jparse.c works is due to the function json_conv_string(). The function that calls that is parse_json_string() and it does so like:

str = json_conv_string(string, len, true);

and the boolean indicates:

 *      quote   true ==>  ignore JSON double quotes: both ptr[0] & ptr[len-1]
 *                        must be '"'

Of course it's entirely possible that at this quick glance I am parsing :-) this wrong :-( but I wonder if this might be something to look at. If it is how the parser does it (the parse_json_string() is in jparse.y so it seems likely) then the question is if that's why jparse gets it right. If it does maybe it's something to consider? Do we need (in this case at least?) the json_decode()? I'd say it's not as simple as replacing it given what that function also does but the question is what is different here?

Again this is assuming that I'm not missing something obvious at my quick glance.

UPDATE 0

I was thinking of the reverse step so the above is not going to work even if it was as simple as swapping (with some modifications) it ... which it would not be almost certainly. Even so it's interesting that that function does seem to work okay and the decode one does not.

The json_decode() function is also certainly the place the fix will impact.

lcn2 commented 2 months ago

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

Actually I wonder if this should be looked at now .. might require a change in how the above is done. But maybe not. I don't know how to even begin to address it however.

A potential change to the API is a reason to fix it sooner than later. Hopefully it is just a fix to the json_decode() which would be less impactful.

lcn2 commented 2 months ago

Another funny thing I noticed is reading from stdin seems wrong. Look:

$ jstrdecode 
foo bar
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

... when I hit ctrl-d obviously. But look:

$ echo foo bar | jstrdecode 
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

What am I missing? I don't recall this behaviour at all. In fact in the man page there's an example where this works.

In JSON encoded strings, a "naked" newline (\n or 0x0a) is NOT allowed. Such characters must be encoded as "\n".

FYI, this works:

echo -n foo bar | jstrdecode

This also works:

printf "%s %s\\\n" foo bar | jstrdecode

As does this using gnu sed:

echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode

To see what is happening, try with -Q:

$ echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode -Q
"foo bar
"

UPDATE 0

No to GH-issuecomment-2339232202 and No to GH-issuecomment-2339246212.

Our guess is that it is NOT related to some EOF or the input being one character too long, but rather than a naked newline is NOT allowed within a JSON encoded string. One MUST encode with these 2 ASCII characters: \n

Or are we missing something?

lcn2 commented 2 months ago

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf. What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

We do not know. You will have to try and research the problem.

SUGGESTION

The problem seems to be on the "JSON string decoding" side. Perhaps the code that converts the 6 ASCII characters: \u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] into a one or more bytes is broken?

lcn2 commented 2 months ago

We believe we have addressed all of the current questions that still need answering at this time. If we've missed something or something else needs to be clarified, please ask again.

xexyl commented 2 months ago

Ahhh right. The newline! Of course. With the hex value that should be obvious.

But the problem there is that how does one enter EOF when they do not specify an arg? It will end up with a newline even if they do something like...

jstrdecode 
foo^D

will show something like:

$ ./jstrdecode -v 5
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
foo^D

Or is this a bug in macOS Terminal?

xexyl commented 2 months ago

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf. What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

We do not know. You will have to try and research the problem.

SUGGESTION

The problem seems to be on the "JSON string decoding" side. Perhaps the code that converts the 6 ASCII characters: \u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] into a one or more bytes is broken?

Are you referring to jenc[] table used by both json_decode() and json_encode()? At a glance it seems that is indeed it but the question is what needs to change? I think that if we're looking at having to have more than one value for each (or some) char(s) then we might need to expand the table. What do you think?

xexyl commented 2 months ago

Ahhh right. The newline! Of course. With the hex value that should be obvious.

But the problem there is that how does one enter EOF when they do not specify an arg? It will end up with a newline even if they do something like...

jstrdecode 
foo^D

will show something like:

$ ./jstrdecode -v 5
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
foo^D

Or is this a bug in macOS Terminal?

Something tells me it might be a bug in macOS Terminal ... which would be unfortunate. But I think it because I know I used to be able to do this and did so for the man page and all was good. So the question is how to get past this one.

As for the how to go about the quotes on stdin versus args when some arg can be '-' the answer might be using a dynamic array of strings that then are added and at the end everything is printed. I have a little bit of time but I am not sure if I can complete that task before I have to go. Later today hopefully I'll have more time and energy.

xexyl commented 2 months ago

I think I solved this bug .. except that I cannot test the typing EOF at the command line.

xexyl commented 2 months ago

Before I commit

... or until I do more testing if you don't get back to me) I want to make sure I have used the dyn_array facility correctly as this is the first time I have done so. It appears to be okay but since this is my first time using the facility I want to run it by you to make sure that I didn't miss something. The test example code suggests I have it right but it's entirely possible I missed something (I guess).

To create a dynamic array for a char * I should do:

    array = dyn_array_create(sizeof(char *), JSON_CHUNK, JSON_CHUNK, true);

QUESTION

Should I use JSON_CHUNK or just CHUNK? I think JSON_CHUNK due to the fact it's JSON related.

Next, to append each value I should do:

if (dyn_array_append_value(array, &buf)) {
                        dbg(DBG_LOW, "moved data after appending buf: %s", buf);
                    }

and to get the length of the array I should do:

array_len = dyn_array_tell(array);

and to then write each element I can do something like:

   for (idx = 0; idx < array_len; ++idx) {
        /*
         * write starting escaped quote if requested
         */
        if (esc_quotes) {
            fprint(stdout, "%s", "\\\"");
        }
        buf = dyn_array_value(array, char *, idx);
        bufsiz = (uintmax_t)strlen(buf);
        errno = 0;              /* pre-clear errno for warnp() */
        outputlen = fwrite(buf, 1, bufsiz, stdout);
        if (outputlen != bufsiz) {
            warnp(__func__, "error: write of %ju bytes of arg: %jd returned: %ju",
                            (uintmax_t)bufsiz, idx, (uintmax_t)outputlen);
            success = false;
        }

        /*
         * write ending escaped quote if requested
         */
        if (esc_quotes) {
            fprint(stdout, "%s", "\\\"");
        }
    }

and finally after everything to free the array:

    /*  
     * free dynamic array
     */ 
    if (array != NULL) {
        dyn_array_free(array);
        array = NULL;
    }

Have I missed anything? If not then this bug at least is fixed, though I know there are other issues.

I know that the append function requires a pointer to a pointer so I had to do a &buf not just buf. Not doing that will segfault. I can run this through valgrind and verify things are okay but you might want something done. But of course those could come later I suppose.

UPDATE 0

Okay there is one mistake above I can see. The optind reference. That obviously has to be removed or done differently.

UPDATE 1

Fixed in above code. As long as arg 0 can be for no arg specified (read from stdin only) then that should be fine.

xexyl commented 2 months ago

And I figured out the bug in Terminal! I have to send EOF twice, not once! Once I do that I get:

jstrdecode -v 5 -Qe foo - bar
debug[1]: enclose in quotes: true
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: escaped quotes: true
debug[1]: processing arg: 0: <foo>
debug[3]: arg length: 3
debug[3]: decode length: 3
debug[1]: about to encode all data on input stream
foodebug[5]: fread returned: 3 normal EOF reading stream at: 3 bytes
debug[3]: stream read length: 3
debug[3]: decode length: 3
debug[1]: processing arg: 2: <bar>
debug[3]: arg length: 3
debug[3]: decode length: 3
"\"foo\"\"foo\"\"bar\""

So that's good too.

I should have known too: I had to do that with my Enigma machine ...

xexyl commented 2 months ago

Although valgrind shows memory leaks not plugged:

$ valgrind --leak-check=full ./jstrdecode -Q -e foo bar -
==952059== Memcheck, a memory error detector
==952059== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==952059== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==952059== Command: ./jstrdecode -Q -e foo bar -
==952059== 
foo"\"foo\"\"bar\"\"foo\""
==952059== 
==952059== HEAP SUMMARY:
==952059==     in use at exit: 63 bytes in 4 blocks
==952059==   total heap usage: 9 allocs, 5 frees, 133,487 bytes allocated
==952059== 
==952059== 10 bytes in 2 blocks are definitely lost in loss record 2 of 3
==952059==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==952059==    by 0x4057E9: json_decode.part.0 (json_parse.c:993)
==952059==    by 0x406CF7: json_decode (json_parse.c:823)
==952059==    by 0x406CF7: json_decode_str (json_parse.c:1215)
==952059==    by 0x402B41: main (jstrdecode.c:273)
==952059== 
==952059== 48 bytes in 1 blocks are definitely lost in loss record 3 of 3
==952059==    at 0x484BF70: calloc (vg_replace_malloc.c:1595)
==952059==    by 0x41D220: dyn_array_create (dyn_array.c:672)
==952059==    by 0x4105D4: read_all (util.c:2205)
==952059==    by 0x40300A: jstrdecode_stream (jstrdecode.c:114)
==952059==    by 0x402A90: main (jstrdecode.c:255)
==952059== 
==952059== LEAK SUMMARY:
==952059==    definitely lost: 58 bytes in 3 blocks
==952059==    indirectly lost: 0 bytes in 0 blocks
==952059==      possibly lost: 0 bytes in 0 blocks
==952059==    still reachable: 5 bytes in 1 blocks
==952059==         suppressed: 0 bytes in 0 blocks
==952059== Reachable blocks (those to which a pointer was found) are not shown.
==952059== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==952059== 
==952059== For lists of detected and suppressed errors, rerun with: -s
==952059== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

But besides from that it does seem good. I'll let this sit for a bit. I have to leave soon anyway.

UPDATE 0

Of course one can see that in valgrind output the output of the program looks incorrect but that's the way the output is being merged together.

xexyl commented 2 months ago

As for the issue of the non ASCII chars being decoded/encoded wrong. It is true that this might require updating another function (that we both thought of) but I think that if the above seems okay then I will commit it and then we can at least have the new option added AND the fixed -Q option as well. Then we can focus on the other more complicated issue.

I will be busy doing other things in about 20 minutes for a while and when back I'll see how I feel. I hope to at least look at the other issues here but I do have some other things I must take care of, I'm afraid. That being said it's quite likely I'll have more time later today, more than usual, so if I can focus my eyes that will be good.

xexyl commented 2 months ago

Actually ... the above change for the option and the quotes breaks jstr_test.sh. It's bizarre too. At first thought it seems like it shouldn't. I'll have to investigate this further.

xexyl commented 2 months ago

Actually ... the above change for the option and the quotes breaks jstr_test.sh. It's bizarre too. At first thought it seems like it shouldn't. I'll have to investigate this further.

Okay it looks like the output of:

./jstrencode -v 5 -n < ./jstrdecode | ./jstrdecode -v 5 -n > foo

in the modified version does not show the full file. I am not sure why at this point but I guess (at a quick thought) it has to do with the change in how the writing to stdout is done and how the args are parsed too.

Unfortunately I have other things I must do so I'll do those things now and then hopefully later I can look into this more. Maybe I can look at the other issues too.

xexyl commented 2 months ago

I have a feeling that the problem has to do with the way that the dynamic array facility is being used.

I have however updated the encode version to use it too.

But the problem obviously has to be worked out before a commit.

I must go afk a while and when back I hope to look at this in more detail but as noted elsewhere I do have other things I need to do today besides the usual things.

xexyl commented 2 months ago

FYI these tests fail:

jstr.txt

xexyl commented 2 months ago

Actually I'm noticing something extremely strange... with debug output. For instance:

debug[1]: enclose in quotes: falsedebug[1]: skip quotes: false
debug[1]: newline output: false
debug[1]:
debug[1]: newline output: falsesilence warnings: false

debug[1]: silence warnings: falsedebug[1]: encoding from stdin
debug[1]: about to encode all data on input stream

which, although before the array processing, makes me think I might be using it wrong on my first go at it, since why else would it be messed up output? However, it's curious to note that it's only on test #2 that this fails which is the command:

./jstrencode -v 5 -n < ./jstrdecode | ./jstrdecode -v 5 -n > ./test_jparse/jstr_test.out

Now #1 and #2 failed but #3 did not. So something funny is going on. I don't recall if this happened yesterday when I only had a modification to the decode one. I wonder.

Hmm I wonder if I try it on another clone. Okay it's not a problem with the library. Obviously it's time for a break but that's perfect timing anyway as I have other things i have to do.

Anyway more details that might or might not (given above) be of use:

jstr.txt

lcn2 commented 2 months ago

QUESTION

Do you still need a reply to GH-issuecomment-2340820041?

And do you still need a reply to GH-issuecomment-2341165850?

We don't know of recent activity has reduced their need or not.

Let us know and we will respond accordingly.

.. we are about to return to active status as per GH-issuecomment-2342694050.

xexyl commented 2 months ago

QUESTION

Do you still need a reply to GH-issuecomment-2340820041?

Something tells me it is relevant but if the answer is 'yes' then that's all that's necessary I would guess.

lcn2 commented 2 months ago

QUESTION

Do you still need a reply to GH-issuecomment-2340820041?

Something tells me it is relevant but if the answer is 'yes' then that's all that's necessary I would guess.

The iPhone forced us to compose in several passes, so you replied too quickly.

Please re-read and reply again. Sorry (tm Canada 🇨🇦)

xexyl commented 2 months ago

QUESTION

Do you still need a reply to GH-issuecomment-2340820041?

As noted earlier I think it probably does need to be discussed but perhaps the question is in what way. Maybe it's if we need to expand it to support more than one encoding of different characters - but I do not know.

And do you still need a reply to GH-issuecomment-2341165850?

The problem was that NUL bytes can be in the strings so strlen() would not work right. This meant that the dynamic array facility could not be used. Instead I used a linked list with a simple struct for the jstr tools. The good thing is I think I have a good idea of how to use the facility now, at least the basics.

So unless you think I was missing something there you no longer need to concern yourself with this comment.

We don't know of recent activity has reduced their need or not.

Let us know and we will respond accordingly.

Thanks for asking. One problem was resolved (as the edit above shows) but the general encoding/decoding bug or bugs have to be discussed.

.. we are about to return to active status as per GH-issuecomment-2342694050.

Thanks.

xexyl commented 2 months ago

QUESTION

Do you still need a reply to GH-issuecomment-2340820041?

Something tells me it is relevant but if the answer is 'yes' then that's all that's necessary I would guess.

The iPhone forced us to compose in several passes, so you replied too quickly.

Please re-read and reply again. Sorry (tm Canada 🇨🇦)

Not a problem. It does not help with the GitHub app UI that when replying to a comment or making a post or a comment that is not a reply even it covers everything up.

xexyl commented 2 months ago

I just took a look at the log file again and I noticed something interesting. In the test cases that failed the buffer size changes drastically in the pipeline. In test 3 which passes for example:

cat $SRC_SET | ./jstrencode -v 5 -n | ./jstrdecode -v 5 -n > ./test_jparse/jstr_test.out
debug[1]: skip quotes: false
debug[1]: newline output: false
debug[1]: silence warnings: false
debug[1]: encoding from stdin
debug[1]: about to encode all data on input stream
debug[1]: enclose in quotes: false
debug[1]: newline output: false
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
debug[5]: fread returned: 39941 normal EOF reading stream at: 367621 bytes
debug[3]: stream read length: 367621
debug[3]: encode length: 392295
debug[3]: array length: 1
debug[3]: processing array index: 0
debug[3]: bufsiz: 392295
debug[5]: fread returned: 64615 normal EOF reading stream at: 392295 bytes
debug[3]: stream read length: 392295
debug[3]: decode length: 367621
debug[3]: array length: 1
debug[3]: processing array index: 0
debug[3]: bufsiz: 367621
./test_jparse/jstr_test.sh: test #3 passed

but when it fails, say from the command line:

./jstrencode -v 5 -n < ./jstrencode | ./jstrdecode -v 5 -n > ./test_jparse/jstr_test.out

it goes from 606340 to 5! The same thing happens with the other one that fails - that is the value drops down to 5.

Wonder what is going on. Perhaps I need to print out that buffer next.

xexyl commented 2 months ago

In the case of it being 5 it is non-ASCII followed by ctrl-L. Interesting. I wonder what could be causing that.

Well again probably not something I can worry about today but maybe I can ponder it.

UPDATE 0

It does make me think though that maybe the array usage is incorrect. Given that I have not used it before and it's quite extensive that is quite possible.

UPDATE 1

Though jnum_gen.c might suggest otherwise. It seems about right unless I'm missing something obvious for some reason.

xexyl commented 2 months ago

I just had a thought. How does the dyn_array manage with NUL bytes in strings? It would make sense if the strlen() reported 5 if it reached a NUL byte at that point. I don't know if there are NUL bytes in the program but if so it might be causing the problem maybe? If that is it I have no idea how to solve it though.

BTW: the bytes in the output file that are failing are the first five bytes in the jstrencode/jstrdecode programs. Maybe that is what is causing the problem?

xexyl commented 2 months ago

Aha! There IS a NUL byte in the string! Just tried the is_string() function and it confirms it.

So we need a way to save the number of bytes of each element. Is the elm_size what we need? I'll try it out ...

UPDATE 0

Actually I'm not sure if that's so easy. Is there a function that can do that?

UPDATE 1

Okay it's for the array itself as I feared.

Maybe a linked list or something else is needed for this?

UPDATE 2

Doing other things for the rest of the day but maybe this thought and test I just had and made will help figure out how to solve the problem.

xexyl commented 2 months ago

With a linked list I fixed the NUL byte problem! I have to do some clean up and that includes freeing the list before the ending of the program but it at least works .. though with one little issue that might have to be resolved. For instance. If I do:

./jstrdecode -
foo^D

it shows:

foofoo

which of course is wrong. But make test works.

Unfortunately I have other things to do too.

UPDATE 0

Actually .. that problem exists in the old way too so that's probably okay - though not idea it's likely because of the terminal.

xexyl commented 2 months ago

Okay the linked list in each is freed (did this a while go). Going through last checks before a commit. make release seems good, it seems clean and the -Q and -e option now works for jstrdecode.

It does not solve all the problems you discovered, of course, but it does solve numerous problems and it gets past the problem of the NUL byte too.

xexyl commented 2 months ago

I have edited the comment that you have eyeballs to look at later, where you asked about two comments in particular. I hope this helps.

xexyl commented 2 months ago

Just did a couple commits including a bug fix to pr_jparse_test. Now working on syncing the jparse repo into the mkiocccentry repo. Once done with that I have to do other things for the rest of the day but at least -Q and -e should work now. We can discuss the other things later.

xexyl commented 2 months ago

BTW: if you have a proposed fix to the json_encode() or json_decode() functions PLEASE open a pull request. I guess we're going to have to discuss the details first though.

Back to what I am working on before I leave for the day.

xexyl commented 2 months ago

With commit c6cc9bd17045c62dea36460ab610ee69fccb69e1 I have added an extra sanity check for the jencchk() function.

Previously it assumed that JSON_BYTE_VALUES is 256 and then went to verify that TBLLEN(jenc) == JSON_BYTE_VALUES with that assumption. Now it first verifies that the macro is 256.

I synced this to the mkiocccentry repo too.

Leaving for the day ... back tomorrow in some form.