zserge / jsmn

Jsmn is a world fastest JSON parser/tokenizer. This is the official repo replacing the old one at Bitbucket
MIT License
3.64k stars 778 forks source link

RFC 8259, bitwise checks, next siblings, doxygen comments, and more #197

Open themobiusproject opened 4 years ago

themobiusproject commented 4 years ago

This PR does a number of things, but the main points are:

Code-wise, the opening and closing of containers, colons, and commas have been pulled out of jsmn_parse and put into their own functions (jsmn_parse_ prefix with container_open, container_close, colon, and comma respectively). jsmn_parse is extremely clean. Overall, the code is quite verbose but most of it reduces to bit checks and comparisons.

For post processing, every token has multiple flags set.

These base flags are combined for complex types such as:

Primitives are further flagged:

This and a lot of validation is done with an enum that can be packed into two bytes. With the enum, vanilla jsmntok_t, parent, and sibling, the final size of the jsmntok_t can be reduced to 12 bytes, or 3 ints, per token.

I have also added over 300 validation tests courtesy of JSONTestSuite. The tests are set up in three groups: the first are the tests that should pass, second are the tests that shouldn't pass, and third are the tests that may pass based on the json parser implementation.

This PR should resolve #118, resolve #131, resolve #137, resolve #158, resolve #163, resolve #177, resolve #178, and resolve #179. I believe this can also resolve #164 by being able to traverse the tokens by sibling.

This should be able to close PR #93, PR #102, PR #108, PR #166, PR #168, PR #172, PR #192, and PR #198.

I also plan on getting utf8 into here with a flag to enable or disable it as to avoid jsmn ballooning in size for small devices.

A strong set of permissive rules would be welcome to further refine my rules.

atomsnc commented 4 years ago

Hello,

There seems to be an issue in this PR where a string like char s[] = "{\"array\":[1,2]}"; fails, when an array is inside an object.

It can be fixed by changing if (type & (JSMN_OBJECT | JSMN_INSD_OBJ)) to if (type & (JSMN_OBJECT | JSMN_INSD_OBJ) && !(type & JSMN_ARRAY)) in jsmn_parse_comma

themobiusproject commented 3 years ago

@atomsnc - Sorry, I saw your post and then life hit me for a bit.

Very good catch. I don't know how I wasn't testing for that case previously. I think your fix fixes the symptom but not the cause, so I solved the problem a little further up by ignoring JSMN_INSD_OBJ altogether.

Thank you, and I am also pleased that someone could get through my code.

pt300 commented 3 years ago

A massive pr

pt300 commented 3 years ago

I really like how strict mode is the default here with possibility of enabling specific relaxations. Especially ones that help cut corners just to squeeze extra performance by reducing conformance. I also think introduction of the sibling feature is a good idea, helps a lot when traversing a more complex JSON.

Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.

I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.

themobiusproject commented 3 years ago

@pt300 I appreciate the comments.

A massive pr

It certainly is.

I swiped the next_sibling idea from #68 and shorter tokens and unsigned ints from #93 years ago when I first found jsmn.

Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.

I made sure to keep the original types and added new types to be able to check for after parsing, and these led into adding extra states into the same enum for deeper checks by knowing what is legal as the next types. I worked hard to keep everything in a 32-bit uint.

I like the enums and bitwise checks because it is calculated at compile time but makes the code easier to read in places. expected = JSMN_PRI_INTEGER | JSMN_PRI_DECIMAL | JSMN_PRI_EXPONENT | JSMN_CLOSE; lets you be able to compare the code to the json rfc standard easily and it is all compile time code.

I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.

I look forward to it. I hope you can feel the spirit of jsmn's current release in my PR because it was a great starting point and I tried to keep as much of it as I could while make some significant changes. I am curious though; what parts feel ugly to you? I know I can write some incredibly dense code that isn't the easiest to follow.

pt300 commented 3 years ago

After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.

atomsnc commented 3 years ago

@themobiusproject I feel like a worthwhile addition would be to have a coalesce strategy in the utils for open tokens when the tokens array cannot be large.

For tokens which have token.end != JSMN_NEG, they can either by dropped or moved out with a callback.

next_sibling will need to be computed again.

Something like

int coalesce_open_tokens(jsmntok_t *tokens, jsmn_parser *parser) {
    jsmnint_t i, index = 0;

    for (i = 0; i < parser->count; i++) {
        if (tokens[i].end == JSMN_NEG) {
            tokens[index++] = tokens[i];            
        }
#if defined(JSMN_NEXT_SIBLING)
        else {
            /* Relink siblings if any */
            for (jsmnint_t j = 0; j < i; j++) {
                if (tokens[j].next_sibling == i) {
                    tokens[j].next_sibling = tokens[i].next_sibling;
                    break;
                }
            }
        }
#endif
    }
    /* Return 0 if all the tokens are being worked on. We truly cannot add more. */ 
    if (index == parser->count) return 0;
    parser->count = index;
    parser->toknext = index;
    return 1;
}

Above code is untested.

themobiusproject commented 3 years ago

@pt300

After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.

I actually also prefer #ifdef and #ifndef, but the reason I went with #if defined() and #if !defined() is because of the occasional instance of #if defined() && defined() which looked horrible when surrounded by #if{,n}defs.

@atomsnc

I feel like a worthwhile addition would be to have a coalesce strategy in the utils for open tokens when the tokens array cannot be large.

Could you better describe this use case? I am trying to figure out where it fits in so I can better understand the code snippet.

atomsnc commented 3 years ago

@themobiusproject When the tokens array for storing tokens while parsing cannot be too big (especially in embedded devices). In that case when the array gets full, user can evict a few completed tokens to parse a bit more.

pt300 commented 3 years ago

I have picked most of the changes you've done in this PR and those are now visible in experimental branch. For now I have left out the utilities and the example utilising them.

Definitely there are more changes coming to what is on the branch right. Mostly cleaning up the code and actually making sure there's no space for nasty surprises. Then, it will go back to adding features again.

holgerschurig commented 2 years ago

@pt300 @themobiusproject I just checked out the experimental branch that contains this patch. However, now example/simple.c doesn't parse the JSON in it anymore. This here still works:

{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users"]}

but this here won't parse anymore:

{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users", "wheel", "audio", "video"]}

This are the warnings during compilation and the (invalid) result of the run:

holger@holger:~/jsmn$ gcc -Wall -Wextra -oa example/simple.c 
example/simple.c: In function ‘jsoneq’:
example/simple.c:16:50: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
   16 |   if (tok->type == JSMN_STRING && (int)strlen(s) == tok->end - tok->start &&
      |                                                  ^~
example/simple.c: In function ‘main’:
example/simple.c:66:21: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
   66 |       for (j = 0; j < t[i + 1].size; j++) {
      |                     ^
holger@holger:~/jsmn$ ./a
Failed to parse JSON: -3

I also tried then PRs 194-196 on top of master. And with that, the example/simple.c still runs.

themobiusproject commented 2 years ago

@holgerschurig I just added the json from the simple.c to my tests and it parses. You can check 0560339. When @pt300 merged my code, he grabbed it before 9f7e9de1 which did leave an error in parsing. That commit needs to be applied to the experimental branch and it should fix the problem.

As for the files in the example, I only ever touch explode.c because I wrote that one along with the jsmn_explodeJSON function associated with it. You can copy JSON_STRING from simple.c (or your unescaped version in your previous comment) into library.json in the root directory and run explode from the examples folder, it will parse and show more information than you will probably want to see.

pt300 commented 2 years ago

Sorry again for the inactivity.

This looks odd, as I remember the tests were passing. I will make sure to look into this tomorrow and put the patch in.

I also really wanted to simplify some parts of the parsing, such as reparsing some tokens from the beginning instead of trying to continue parsing. Should cut down complexity without major performance hit + make the code easier to maintain and more difficult to break. But, you can see how well it's going. Hope you can forgive me.

pt300 commented 2 years ago

I have applied the changes, but there seem to be still an issue with parsing. I will have to revisit it some other time.