xexyl / jparse

JSON parser, library and tools written in C
3 stars 3 forks source link

Bug: the concept of `jstrencode(1)` appears to be both very wrong and very buggy #28

Open xexyl opened 1 week ago

xexyl commented 1 week ago

Is there an existing issue for this?

Describe the bug

Based on JavaScript encoding of JSON, there appear to be multiple problems with our tool.

First of all, it appears that the \ being converted to a \\ is wrong. I will give examples in the what we should expect section.

The next problem is that code points should be converted to unicode symbols just like with jstrdecode(1). This is how it is with JavaScript too: both encode and decode need to do this.

Another problem is that the \ escape chars should not be done the way we have it. See what to expect for examples.

What you expect

In order of the problems above, here are examples of what jstrencode(1) does and what JavaScript does.

If we have the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\bo"
        };

it SHOULD (i.e. this is what JavaScript does) encode to:

{"name":"༏f\bo"}

but our tool converts the string to:

$  jstrencode '"\u0f0ff\bo"'
\"\\u0f0ff\\bo\"

Notice the double \ before the b! Now on the subject of the escaped quotes in the beginning see the anything else section.

Now as far as the code points go, javascript of:

const json_to_encode = {
          name: "\u0f0f"
        };

converts to:

{"name":"༏"}

but our tool does this:

$ jstrencode '{"name": "\u0f0f"}'
{\"name\": \"\\u0f0f\"}

or as just a string:

$ jstrencode '"\u0f0f"'
\"\\u0f0f\"

Now if the \uxxxx was converted to a unicode symbol it might just be that the \" surrounding the output would be the difference but I am not sure of this.

The third problem appears to be even worse. There might be other cases where something like this happens but anyway the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\co"
        };

... turns into:

{"name":"༏fco"}

Notice how the \c has the \ silently removed and the c is by itself (or rather it's after the unicode symbol and before the o).

But our tool does something extremely wrong. First as a string by itself:

$ jstrencode '"\u0f0ff\co"'

0;xexyl@xexyz:~$

.. or as the json with {}:

$ jstrencode {"name":"\u0f0ff\co"}

0;xexyl@xexyz:~$

As can be seen the encoding concept we have appears to be totally wrong.

Environment

jparse_bug_report.sh output

n/a

Anything else?

As for the escaped " surrounding the string. I guess it depends on how this tool will be used but even so it would appear to be the wrong default.

As for the doubling of \ it also depends on how we need to use it.

But clearly the \ of non-valid escape chars seems to be wrong.

Of course it could be that it's because I am tired or because I do not really know JavaScript but it might not be. I think it probably isn't either of those. But of course depending on what our tool needs to do that deviates from a normal json encoder ....

xexyl commented 1 week ago

One thing that should be done and which I will do is to rename the utf8encode() function to not suggest encoding or decoding but rather unicode. This way there is no confusion in the decode function.

xexyl commented 1 week ago

Something I do not understand, looking at the so-called json spec and the JavaScript output, is that for the non valid \- characters, it would appear that the \ should just be part of the string. But then in JavaScript it looks like the \ is removed and the character is there.

I presume that this is not understanding the standard correctly (if we can call something about it 'correct' :-) ) or because I am tired and not reading the grammar right. I rather suspect it's a bit of both.

lcn2 commented 1 week ago

We are in the process of making some minor edits to jstrdecode(1) (and friends). We will apply these same edits to the mkiocccmetry repo. This in order to fix some minor issues with make quick_www and make www in the "other (temp-test-ioccc) repo".

We do not expect these edits to me major, just a few tweaks needed.

Once the make quick_www and make www work well, we will issue a PR with respect to this repo as well as an commit update to the mkiocccmetry repo and finally a commit to the "other (temp-test-ioccc) repo".

Please, @xexyl, stand by / try to withhold any edits and PRs for short term. We will advise when we are done.

xexyl commented 1 week ago

We are in the process of making some minor edits to jstrdecode(1) (and friends). We will apply these same edits to the mkiocccmetry repo. This in order to fix some minor issues with make quick_www and make www in the "other (temp-test-ioccc) repo".

We do not expect these edits to me major, just a few tweaks needed.

Once the make quick_www and make www work well, we will issue a PR with respect to this repo as well as an commit update to the mkiocccmetry repo and finally a commit to the "other (temp-test-ioccc) repo".

Please, @xexyl, stand by / try to withhold any edits and PRs for short term. We will advise when we are done.

Thank you! I won't make any edits. I am done for the day and given that (as I brought up in the other thread) I don't even know what you have in mind for the functionality I probably can't do much anyway.

lcn2 commented 1 week ago

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

lcn2 commented 1 week ago

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

UPDATE 0

When we return after dinner, we plan to update the documentation as per our minor edits.

lcn2 commented 1 week ago

NOTE: In all of these example, we do NOT have a trailing newline.

Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭

What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes.

What should happen? An ERROR should be thrown.

Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash.

Either way, the JSON encoded string is invalid and the resulting JSON is not valid.

It's your choice as to what error you'd prefer to throw.

We recommend such a test case be added to the "invalid JSON" test suite.

lcn2 commented 1 week ago

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

UPDATE 0

When we return after dinner, we plan to update the documentation as per our minor edits.

See PR #29

UPDATE 0

With pr #29 and release 1.6.13 2024-11-16 of the mkiocccentry repo and commit c47f3abbe47cc63c33a5f4b99ac6ca35cb773dfe of the temp-test-ioccc repo we have now entered a soft code freeze for the mkiocccentry repo.

UPDATE 1a

Please see, @xexyl, GH-issuecomment-2480929883.

Please also see GH-issuecomment-2481266359.

xexyl commented 1 week ago

Good things above. I'll reply now. Unfortunately did not have time to address anything before the mkiocccentry sync but these things are not really important for there I guess. I don't think you addressed everything I brought up in this issue but we can discuss those later.

xexyl commented 1 week ago

NOTE: In all of these example, we do NOT have a trailing newline.

Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭

What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes.

What should happen? An ERROR should be thrown.

Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash.

Either way, the JSON encoded string is invalid and the resulting JSON is not valid.

It's your choice as to what error you'd prefer to throw.

We recommend such a test case be added to the "invalid JSON" test suite.

Other than this issue there are others as well.

It appears to me that we would have to keep track of each character in the decoding code. That would complicate matters greatly. We deliberately designed the regexp in the parser/lexer to not worry about certain details since we could take care of them in the code outside the parser/scanner. But the ending quotes would be easily solved by passing the strings through the parser for validation.

Some might say this would be overly complicated but it might be the cleaner way since otherwise we would have to add even more code to the decode functions (and maybe encode too?). It would be far easier to use the parser to do this and after all, isn't that the point of it - to verify it's valid and to provide a way to traverse it? We do not have to traverse it but it using the parse function would sure simplify these tools when it comes to errors.

What do you think?

xexyl commented 1 week ago

Unsure but it might also help with some of the other problems in this issue. Not sure. Just a thought. I'll look back at this thread and the OT one tomorrow. Got behind in things I have to do today so doing those things now.

lcn2 commented 1 week ago

NOTE: In all of these example, we do NOT have a trailing newline. Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭 What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes. What should happen? An ERROR should be thrown. Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash. Either way, the JSON encoded string is invalid and the resulting JSON is not valid. It's your choice as to what error you'd prefer to throw. We recommend such a test case be added to the "invalid JSON" test suite.

Other than this issue there are others as well.

It appears to me that we would have to keep track of each character in the decoding code. That would complicate matters greatly. We deliberately designed the regexp in the parser/lexer to not worry about certain details since we could take care of them in the code outside the parser/scanner. But the ending quotes would be easily solved by passing the strings through the parser for validation.

Some might say this would be overly complicated but it might be the cleaner way since otherwise we would have to add even more code to the decode functions (and maybe encode too?). It would be far easier to use the parser to do this and after all, isn't that the point of it - to verify it's valid and to provide a way to traverse it? We do not have to traverse it but it using the parse function would sure simplify these tools when it comes to errors.

What do you think?

We do NOT think that internal string formatting, such as trialing backslashes and Unicode/UTF-8, belong in the bison/flex scanner/parser. We strongly recommend against involving the bison/flex scanner/parser in details such as determine if a JSON strings properly encoded.

lcn2 commented 1 week ago

We need to move on and focus on the Great Fork Merge and getting ready for the IOCCC28.

After the IOCCC28 closes, and the judges go on a vacation, and we address calc v3 fork, we can return to considering improvements to this JSON parser.

UPDATE 0

Sorry (tm Canada 🇨🇦).

xexyl commented 1 week ago

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems?

My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all.

But if you can think of something better or easier please let me know.

xexyl commented 1 week ago

We need to move on and focus on the Great Fork Merge and getting ready for the IOCCC28.

After the IOCCC28 closes, and the judges go on a vacation, and we address calc v3 fork, we can return to considering improvements to this JSON parser.

UPDATE 0

Sorry (tm Canada 🇨🇦).

I'm looking forward to it so no problem. I also have other things to concern myself with as well.

These do not seem critical, certainly, even if it'd be better if it was resolved. The example you gave is easy to see what is happening but solving it is more complicated than figuring out what it is.

Anyway best wishes. We can (as I think I said already) talk about this another time after the next IOCCC.

lcn2 commented 1 week ago

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems?

My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all.

But if you can think of something better or easier please let me know.

Let the routines that form the parse tree call functions that detect what the scanner/parser thinks of as a string, and if the JSON string is not properly encoded (for whatever reason), raise an error and not return the parse tree. That way, the scanner/parser is agnostic as to the internal content of things such as strings. Meanwhile from the point of an application calling the JSON scanner/parser, a bogus encoding of a JSON string is still flagged as an error.

We hope this helps.

UPDATE 0

Some typos fixed.

UPDATE 1

Time to move on for us as we need to focus on the two significant open Great Fork Merge TODOs.

xexyl commented 1 week ago

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems? My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all. But if you can think of something better or easier please let me know.

Let the routines that form the parse tree call functions that detect what the scanner/parser thinks of as a string, and if the JSON string is not properly encoded (for whatever reason), raise an error and not return the parse tree. That way, the scanner/parser is agnostic as to the internal content of things such as strings. Meanwhile from the point of an application calling the JSON scanner/parser, a bogus encoding of a JSON string is still flagged as an error.

That, as I read it, is exactly what I was getting at. Using the parser can determine if the string is properly encoded. If it's not an error is raised and the decoding function is not called. If it is valid JSON it can be called.

Also as an extra sanity check we could invoke the JSON parser on the encode tool after encoding, to verify that it's valid JSON.

I'll do that but I have some other things in my mind to do with jparse. First one I think will improve error messages when working with strings and not files.

We hope this helps.

Yes it does.

UPDATE 0

Some typos fixed.

Ever notice that TYPOS is an anagram for SPOT + Y? And TYPO is an anagram for POT+Y? :-)

UPDATE 1

Time to move on for us as we need to focus on the two significant open Great Fork Merge TODOs.

Great!

Today I have a blood draw, unfortunately. Not that vampires really bother me but I'd rather not go out. Also this place might insist I sit down and I really don't want to esp as I always change clothes after sitting down in another's chair and our washer is not working right (hopefully it can be repaired Wednesday). So I'm doing everything I can to not change outer clothes until it's fixed.

Hope you have a great day! In a while I will probably be afk a bit but I do hope to do a number of things here today. I also have other things to do too of course.

Hope you slept well or you're having a nice sleep! I also hope that if you're back home that you had a nice holiday and you're safe and sound and well!

THANK YOU for the feedback! It won't solve all the issues I raised in this thread but it will solve one that you came up with (I did not even think of the example you gave but it's a great one to fix).

xexyl commented 1 week ago

That new functionality is added .. just running make test. If all is good I'll then work on the jstrdecode issue. But I might be going afk before that: not sure.

xexyl commented 1 week ago

Next thing actually is fixing the man page ... for newer functions and strings. But I have to go afk now for a while.

Once I've done the man pages I'll then return to jstrdecode. Already added the new functionality (as in committed) of the string parsing wrt better error messages.

xexyl commented 1 week ago

I ended up updating the library in a significant way. I noticed that some function names are ambiguous. Thus the names were renamed. However given that these are internal functions it should not matter. Namely:

-extern struct json *parse_json_string(char const *string, size_t len);
-extern struct json *parse_json_bool(char const *string);
-extern struct json *parse_json_null(char const *string);
-extern struct json *parse_json_number(char const *string);
-extern struct json *parse_json_array(struct json *elements);
-extern struct json *parse_json_member(struct json *name, struct json *value);
+extern struct json *json_parse_string(char const *string, size_t len);
+extern struct json *json_parse_bool(char const *string);
+extern struct json *json_parse_null(char const *string);
+extern struct json *json_parse_number(char const *string);
+extern struct json *json_parse_array(struct json *elements);
+extern struct json *json_parse_member(struct json *name, struct json *value);

They were too similar to the parse_json() functions in name.

I also updated the jparse.3 man page.

Unfortunately I also accidentally ODed on a medication so I am feeling quite awful. This should not happened but unfortunate circumstances led to this. I had a blood draw but that was postponed until Friday. I'll be fine though.

I hope to look at the parse call in jstrdecode but I'm not sure if I'll get to that today. If I don't today I won't until Wednesday at least most likely.

However I think once it is done it'll be worth it. It won't solve all problems here but it will solve some.

Of course none of this needs to be synced to mkiocccentry right now .. can wait until after IOCCC28 unless of course you need the jstrdecode bug fixes applied for the website.

I hope you're doing well with the Great Fork Merge! I do have other things I have to do and given what happened it seems that I'll have to start doing that soon. Though I would think the adding the parser to jstrdecode won't take much effort: though it'll also require a new option, namely -J. It might also be good to have an option that says to NOT use the parser. I can think of a few cases like the jstr_test.sh file where this might be useful but I'll have to wait and see .. ideally would be good to fix that script too.

xexyl commented 1 week ago

Functionality added .. have to fix jstr_test.sh. If all is good I'll commit and that'll be all I do today - well after updating documentation I'll commit.

xexyl commented 1 week ago

Hmm .. strange. The input '"foo\"' did cause an error. Still the parser being there is a good thing so I'll commit soon. Then I must go.

UPDATE 0

Well okay it only warned but now it shows an error.

xexyl commented 1 week ago

Oh great .. now there is a linker error in linux:

/usr/bin/ld: ../libjparse.a(json_parse.o): in function `json_process_floating':
/home/xexyl/code/jparse/json_parse.c:3083: undefined reference to `floorl'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3111: undefined reference to `floor'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3133: undefined reference to `floorl'
xexyl commented 1 week ago

Nothing should have changed here .. no idea how to fix. Feature test macros do not work :(

xexyl commented 1 week ago

This problem has been going on for some time it seems. But I don't know why. Not good.

xexyl commented 1 week ago

My concern is what if it's something that's a blocker for people in linux using it for mkiocccentry? I'll open a bug report and then I must leave.

xexyl commented 1 week ago

Interesting .. even adding -lm to LDFLAGS and I do not see it in the compile line. Hmm....

xexyl commented 1 week ago

Got it. It was util_test in test_jparse. Okay that's a relief. Will fix it but that means it'll have to be updated in mkiocccentry. Sorry :(

xexyl commented 1 week ago

Oh great .. now there is a linker error in linux:

/usr/bin/ld: ../libjparse.a(json_parse.o): in function `json_process_floating':
/home/xexyl/code/jparse/json_parse.c:3083: undefined reference to `floorl'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3111: undefined reference to `floor'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3133: undefined reference to `floorl'

With commit ef45fa81f1cfb263fd4d0b9835681ac424ea2dac this horrible mess should be prevented in the future. This caused the problems in mkiocccentry and another problem here too. The -lm was not even needed it turns out, all because of the makefile.local file I had on the server!

If you wish me to remove that functionality from dyn_array and dbg I can do that but not until tomorrow. I know mkiocccentry is in code freeze status so that can happen later. I just hope I can fix my clone .. but might have to wait until another commit is pushed to it as it seems that it might want to merge (for reasons beyond my comprehension) commits that were closed.

Good luck with the Great Fork Merge and good day! It's going to be a difficult, long day and that's why I'm stopping now .. working on other things for the time being but probably a bad idea to work here until tomorrow.

lcn2 commented 1 week ago

TIME TO PAUSE

We urgently need to FOCUS on IOCCC28 matters and this will have to respectfully delay for months responding to comments about this repo. Unless we do that, IOCCC28 is not going to start this calendar year ... and we don't want to deal with the consequences of that!