zserge / jsmn

Jsmn is a world fastest JSON parser/tokenizer. This is the official repo replacing the old one at Bitbucket
MIT License
3.64k stars 778 forks source link

Comply with RFC 8259 in strict mode #194

Open dominickpastore opened 4 years ago

dominickpastore commented 4 years ago

This PR brings strict mode in compliance with RFC 8259 (as well as some other changes):

At the same time, this led to some significant changes to non-strict mode (consistent with comments I previously made on issue #154). In summary, non-strict mode is now limited to the following three deviations from RFC 8259: 1) Primitives can be made of any character that doesn't have special meaning in JSON (whitespace and any of {}[],:"), 2) Strings can contain any character and invalid backslash escapes, and 3) Primitives can be object keys, not just strings. Reasoning for these changes is below. (But see also comment later.)

Unenclosed objects, missing values, and missing commas

Non-strict mode previously allowed the following examples to parse:

"server": "127.0.0.1"
"port": 80
"message": "Hello world"
{"a":1 "b":2}
{
  "a":1,
  "b"
}

I believe this was intentional; however, this relaxed validation carried over into strict mode in several places (see issues #52, #131, #158). I added a simple state machine to the parser to fix the issues in strict mode, but fixing strict mode without affecting non-strict mode would have been difficult. As a result, non-strict mode no longer allows the above examples. I can think of ways we could make it work, but the added complexity didn't seem worth it.

Arrays and objects as keys

Non-strict mode no longer allows arrays and objects as object keys. This eliminates issue #193, but breaks backward compatibility for some non-RFC-8259-compliant JSON. Here is the reasoning:

Conclusion

These are some pretty significant changes to jsmn, but seeing the number of issues about invalid input being accepted, I hope they are well received. I have some ideas for future improvements also, but I will discuss that on the Roadmap issue (#159).

dominickpastore commented 4 years ago

Pull request is in draft mode for now as I need to make a couple changes to fix parsing multiple JSON objects at once and improve validation for primitives, but I wanted to go ahead and get it out there in case there is any potential discussion.

dominickpastore commented 4 years ago

The remaining changes are now complete, and I've added test cases for the new functionality. Marking as "Ready to review."

themobiusproject commented 4 years ago

This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.

My jsmn_t ends up being:

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value
#endif
} jsmntype_t;

and the jsmn_parser ends up as

typedef struct jsmn_parser {
  jsmnint_t pos;            //!< offset in the JSON string
  jsmnint_t toknext;        //!< next token to allocate
  jsmnint_t toksuper;       //!< superior token node, e.g. parent object or array
#ifdef JSMN_STRICT
  jsmnenumtype_t expected;  //!< Expected jsmn type(s)
#endif
} jsmn_parser;

With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:

if (token->type & JSMN_OBJECT) {
  parser->expected = (JSMN_STRING | JSMN_CLOSE);
} else {
  parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE);
}

and then check parser->expected against what the next item actually is

if ((parser->expected & JSMN_PRIMITIVE) == 0) {
  // error handling
}
// continue with parsing

This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.

themobiusproject commented 4 years ago

I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value

  JSMN_OBJECT_KEY    = (JSMN_KEY | JSMN_OBJECT),
  JSMN_ARRAY_KEY     = (JSMN_KEY | JSMN_ARRAY),
  JSMN_STRING_KEY    = (JSMN_KEY | JSMN_STRING),
  JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE),
#endif
} jsmntype_t;
dominickpastore commented 4 years ago

This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.

My jsmn_t ends up being:

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value
#endif
} jsmntype_t;

and the jsmn_parser ends up as

typedef struct jsmn_parser {
  jsmnint_t pos;            //!< offset in the JSON string
  jsmnint_t toknext;        //!< next token to allocate
  jsmnint_t toksuper;       //!< superior token node, e.g. parent object or array
#ifdef JSMN_STRICT
  jsmnenumtype_t expected;  //!< Expected jsmn type(s)
#endif
} jsmn_parser;

With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:

if (token->type & JSMN_OBJECT) {
  parser->expected = (JSMN_STRING | JSMN_CLOSE);
} else {
  parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE);
}

and then check parser->expected against what the next item actually is

if ((parser->expected & JSMN_PRIMITIVE) == 0) {
  // error handling
}
// continue with parsing

This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.

That is an interesting approach. Do you plan to share the implementation in a pull request?

dominickpastore commented 4 years ago

I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value

  JSMN_OBJECT_KEY    = (JSMN_KEY | JSMN_OBJECT),
  JSMN_ARRAY_KEY     = (JSMN_KEY | JSMN_ARRAY),
  JSMN_STRING_KEY    = (JSMN_KEY | JSMN_STRING),
  JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE),
#endif
} jsmntype_t;

It seems you have already noticed, but for the benefit of other readers, I use a new jsmnstate_t enum in the same way as your extended jsmntype_t:

typedef enum {
  JSMN_STATE_ROOT = 0x0,          /* At root */
  JSMN_STATE_OBJ_NEW = 0x16,      /* Expecting object key or } */
  JSMN_STATE_OBJ_KEY = 0x12,      /* Expecting object key */
  JSMN_STATE_OBJ_COLON = 0x13,    /* Expecting object colon */
  JSMN_STATE_OBJ_VAL = 0x10,      /* Expecting object value */
  JSMN_STATE_OBJ_COMMA = 0x15,    /* Expecting object comma or } */
  JSMN_STATE_ARRAY_NEW = 0x24,    /* Expecting array item or ] */
  JSMN_STATE_ARRAY_ITEM = 0x20,   /* Expecting array item */
  JSMN_STATE_ARRAY_COMMA = 0x25   /* Expecting array comma or ] */
} jsmnstate_t;

I probably could have used explicit constants for individual bits to make the code more readable and maintainable (in fact, I probably should add it still):

typedef enum {
  JSMN_DELIMITER = 0x1,         /* Expecting colon if JSMN_KEY, comma if not */
  JSMN_KEY = 0x2,               /* Expecting an object key */
  JSMN_CAN_CLOSE = 0x4,         /* Object or array can be closed */

  JSMN_CONTAINER_OBJ = 0x10,    /* Inside an object */
  JSMN_CONTAINER_ARRAY = 0x20,  /* Inside an array */

  JSMN_STATE_ROOT = 0x0,        /* At root */
  JSMN_STATE_OBJ_NEW = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_CAN_CLOSE),
  JSMN_STATE_OBJ_KEY = (JSMN_CONTAINER_OBJ | JSMN_KEY),
  JSMN_STATE_OBJ_COLON = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_DELIMITER),
  JSMN_STATE_OBJ_VAL = (JSMN_CONTAINER_OBJ),
  JSMN_STATE_OBJ_COMMA = (JSMN_CONTAINER_OBJ | JSMN_DELIMITER | JSMN_CAN_CLOSE),
  JSMN_STATE_ARRAY_NEW = (JSMN_CONTAINER_ARRAY | JSMN_CAN_CLOSE),
  JSMN_STATE_ARRAY_ITEM = (JSMN_CONTAINER_ARRAY),
  JSMN_STATE_ARRAY_COMMA = (JSMN_CONTAINER_ARRAY | JSMN_DELIMITER | JSMN_CAN_CLOSE)
} jsmnstate_t;

The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.

dominickpastore commented 4 years ago

One other thing I forgot to put in the description: calling jsmn_parse() with tokens set to NULL skips a lot of the validation that would otherwise happen, for the sake of performance. (The assumption being that such a call would almost always be followed by a call where tokens has been allocated, so no need to do all the validations twice.)

I will add an extra bullet point to the description for this.

dominickpastore commented 4 years ago

@themobiusproject I made some improvements based on your suggestion to use bitwise logic for the enum values. I think it helps readability a lot.

dominickpastore commented 4 years ago

I've added a branch with a fourth non-strict extension in my own fork: valueless-keys. It adds the ability to parse objects with missing values, e.g.:

{
  "a": 1,
  "b",
  "c": 3
}

I already have a bunch of pull requests pending here, so I'm not going to submit another one right now, but I thought I should at least mention it since this was originally allowed under non-strict mode.

themobiusproject commented 4 years ago

Let me finish up cleaning up a few bits so all of the previous tests pass again and I will at least put in a link to my fork. I have a lot of changes beyond this (cmake, visual studio, xcode projects, doxygen comments, tests in cmocka, extra json.{c,h} for wrapping jsmn) that don't necessarily go with just this issue and would have to pull the correct pieces for a clean pull request.

themobiusproject commented 4 years ago

@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.

master, right now, only has two tests left as FIXME and they are non-strict tests which I haven't been looking into. I cleared out some of the original strict type checks as the parser->expected versions covered them.

I also like your idea of setting a flag to avoid strict checks a second time if it has already been done once to find the number of tokens.

The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.

Also, regardless of how long the enum is, it won't touch change compile size or performance, so having each and every option does seem like a good idea; I'll just ignore looking at it in the code.

dominickpastore commented 4 years ago

@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.

Just want to clarify: if you weren't already planning to create pull requests, don't feel like you have to just because I brought it up. I was just curious.

themobiusproject commented 4 years ago

I was looking at it previously because I like what I have come up with and would be happy to share it with others, especially through the official repo. Since this pushed me to finish the bugs that I had, I was going to get back to it.

noodlecollie commented 3 years ago

I have a quick question regarding the improvements to strict mode: do these changes also fix the following?

{ 
  "array":
  [ 
    "missing a comma"
    "between these items"
  ]
}

Even with strict mode enabled, this results in the same tokens as if a comma had been present between the two items. jsonlint.com rejects the JSON above as not being valid. If it's fixed in this PR then great; otherwise, I'll make a separate issue.

dominickpastore commented 3 years ago

@x6herbius, this PR should fix that. That said, I'm not sure if this repository has been abandoned. The maintainers seem to have disappeared. Thus, while I had some other ideas for improvements and even submitted PRs for two of them (#195, #196), I've given up on making further contributions.

I do have a couple more changes I made on my own fork (see valueless-keys and unsigned-types branches there), but I don't plan to submit PRs for them unless these existing PRs get accepted.

PR #197 may also be worth checking out. They've made similar sorts of improvements (including proper RFC compliance, I believe) through a different style of implementation.

pt300 commented 3 years ago

@dominickpastore I'm still here. I haven't done much as it's been difficult for me due to how big the changes feel. While JSMN is not in the best shape at the moment, it is being dependent on by quite some people and I honestly fear breaking anything just because I did not analyse the changes properly.

To give this repo a bit of life I will try to make a new experimental branch and possibly merge some of the PRs there. Maybe that way it will be possible to get better feedback from community about the changes. And I hope you won't feel like your contributions are a waste.

dominickpastore commented 3 years ago

@pt300 That's good to hear. I was a bit concerned when @themobiusproject and I added some discussion to some of the roadmap issues (#159 and #154) and it seemed to go unnoticed.

I don't want to assume my PRs are the best for the project, either (especially now that there is also PR #197 for a lot of the same stuff). I made the changes because they were what I needed. But they did align with some parts of the roadmap, so I figured I'd submit them to help kickstart the discussion again.

Other thing I would say: This is just my opinion, but it sounds like you are putting too much pressure on yourself. I'd bet, chances are, as long as JSMN can parse standard JSON, 90% of people will be happy. For the rest, it's up to them to speak up if they have a problem with the roadmap. And even if they don't speak up, nothing is forcing them to stop using the current version--that's the beauty of open source.

themobiusproject commented 3 years ago

@dominickpastore @pt300 I am still hanging around here as well.

Both @dominickpastore's PR #194, #195, and #196 and my PR #197 are all valid options. I had once tried to break my PR into smaller changes but continued having chicken and egg problems so all of the changes are lumped into one. I also will comment, @pt300, dominick's and my changes won't merge together as we have both made many fundamental changes that won't jive together (though I know I have swiped some ideas from dominick's code to fix issues in mine).

I certainly don't feel like my contributions have been a waste. There was someone who posted in my PR with a problem that he was having so it seems like someone has been able to benefit what I have done.

alexanderankin commented 3 years ago

it may be useful to note which branch is the most up to date and recommended for new projects, on whatever repository/fork, while issues of backwards compatibility and breaking get resolved.