Roadmap - Githubissues

zserge / jsmn

Jsmn is a world fastest JSON parser/tokenizer. This is the official repo replacing the old one at Bitbucket

MIT License

3.62k stars 774 forks source link

Roadmap #159

Open pt300 opened 5 years ago

pt300 commented 5 years ago

Here I'd like present my suggestions for what I believe could be done to improve quality of JSMN. Everyone should feel free to comment on this and give own suggestions. Also, I will need opinion from @zserge on this.

Split strict and non-strict parsing code Idea behind it is that this would make it easier to maintain both versions without having to overcomplicate the code and to eliminate risk of breaking strict version when trying to improve non-strict one
Make strict mode the default Considering that JSMN is mostly aiming at parsing JSON strings and not it's abbreviations the library should do it by default.
Enable parent links by default Parent links feature enables a great speed boost at a quite low memory cost. I believe a lot of people aren't using this library on really memory constrained platforms. But if there's a need for it the user should just disable parent links. It could be probably nice to rename the define to something like JSMN_LOW_MEMORY.
Strict mode should follow RFC-8259 There are a few standards with slight differences for JSON. I believe it should be explicitly said that we would follow RFC-8259 as THE standard for our JSON parsing.
Maybe create a package for Arduino Considering the fact JSMN aim at platforms that are similar to what Arduino offers I believe it would be nice to either offer this library as is to Arduino community or separately write a wrapper to make access to this library more C++ like.
Bit flags for token types #108 Small change that would make it easier to write parsing code in certain places.

Utilities

Token array traversing Using JSMN in a safe manner can be difficult. Because of that I think it would be nice to write an optional header file with utilities to easily and safely traverse JSMN's output. These should work at least for the strict mode.
Values decoding At this point JSMN made the decision of not parsing the values. It could be left this way but then I believe it would be great to optionally offer utilities that could take for example a string from a token and decode it into UTF-8 or a widechar string.

ahogen commented 5 years ago

I've been using JSMN in a somewhat memory constrained application. The 16-bit micro has 64kB internal flash, but due to other required libraries, I didn't have a lot of room to spare. So keeping a low-memory footprint in mind is important to me. JSMN_LOW_MEMORY sounds great.

Utilities

Values decoding ...

Some platforms I've used don't have the non-standard itoa() functions. So when I'm generating JSON replies, I've had to create a local_itoa() sort of function, optimized to only support bases 2 - 16. This is more of a values encoding idea, but maybe something to discuss?

xdevelnet commented 5 years ago

So, first of all I'd like you to thank for such nicely done piece of free software. I really appreciated of it.

And like ahogen, I also use this parser.

Ok, since you're planning to make significant changes to this project, I'd like to post here some thoughts about this:

At this point JSMN made the decision of not parsing the values

For my humble opinion you should at least check if this is even printable character in case of regular strings. If in double quotes pair suddenly happens binary garbage - that should not be parsed anyway, welp.

From other point of view, this project is not JSON validator, but JSON parser. Maybe it's better to throw away any validation since current validation is not even close to what should be and focus on performance and low memory usage. Then, recommend users to use separate modules for JSON validation, like this: http://www.json.org/JSON_checker/JSON_checker.c

pt300 commented 5 years ago

Well, for the JSON to be parsable it has to be valid so as long as the parser is never wrong then if we are able to parse the thing then we are basically validating it. There's no way out of it. In JSMN values aren't really fully parsed and so they are not validated.

How I think it could work is that the utilities for parsing JSON values as seen by JSMN will have to adhere to RFC-8259 anyway and so they will be able to tell you whether there is a problem with value itself. For example if token is of type string but it contains for example a byte of value 0x05, then the value parser should return an error as this is invalid according to the standard.

pt300 commented 5 years ago

Also I'm feeling the a lot of pressure here as it seems like the original creator of this project either abandoned it or is too busy/tired to work on it and beside him I am the only person that can do anything with this repository. I have ambition of keeping JSMN as great as it is when it comes to parsing speed and memory usage but make it better at actually parsing JSON. Right now in my opinion there are quite a bit parsing bugs in JSMN which mostly come from the fact of how it's built.

Right now I have an idea of how JSMN could be slightly modified in a sense of idea of how the parsing happens but it would mean somewhat of a large code changes. And so I thought that why not as well just think of what else could be changed and just make JSMN the best thing out there that's easy to use.

xdevelnet commented 5 years ago

In JSMN values aren't really fully parsed and so they are not validated.

Currently, if \n happens in the middle of " " pair, JSMN still will parse it. BTW it could be fixed, so I'd like to ask: are you accepting PR's or you're already started developing an update for this project so there is not reason to even open PR?

Also I'm feeling the a lot of pressure here

Dunno. It's free software, people are free to modify it for their own needs. For my humble opinion, from human philosophy PoV: If nothing was given then nothing should be expected. This thought partially described here: https://github.com/do-not-theme/do-not-theme.github.io/issues/3#issuecomment-496958706

I have ambition of keeping JSMN as great as it is when it comes to parsing speed and memory usage but make it better at actually parsing JSON.`

Hmmm... It reminds me an old joke about having a fast, reliable, strong, and cheap vehicle at same time. I have no idea if you can keep the same performance with all required validation.

pt300 commented 5 years ago

Haven't really done any work myself yet. So far it's all planning and getting the courage to do any changes to this repository. After all I'm not @zserge and this is basically official repository of JSMN. I was given responsibility I haven't exactly asked for :p

Feel free to create PR. Although I think the way library works should be discussed to make it as close to perfect without weird trade-offs. But you are right, there are limits to what can be achieved here.

zserge commented 4 years ago

@pt300 I really like the suggested list of changes. Strict mode by default, compatibility with RFC8259, Arduino package and bitflags - all sounds good.

But I want to share my vision of the 'non-strict mode'. Basically, it's not a feature of JSMN, but rather a side-effect of "what would happen if we #undef some of the parser code". This eliminates certain sanity checks, and parses invalid but nicely looking JSON-ish text, with unquoted strings and the absence of the root object. This has never been a priority for JSMN (in my mind), and I don't think it deserves a separate implementation. If, by undef-ing some code form the RFC-compatible strict implementation we can achieve some syntax sugar for JSON - fine, if not - I'd rather forget about non-strict mode at all.

zserge commented 4 years ago

Additionally, I would rather invest into advanced documented supersets of JSON, like JSON5, rather than in the undocumented, barely known non-strict mode.

pt300 commented 4 years ago

What about the ability of estimation of token array length? Considering it works only for specific cases I think it's not a really good feature to keep around, especially that it seems to add more complexity to the code than it's worth.

CaCtus491 commented 4 years ago

In my case I need to open a JSON object from an embedded file system based storage that may potentially be too large (up to ~64kb) to fit into the available RAM (~4 - 8kB), pull certain configuration values from it, then I'm done. I'm not particularly concerned about speed. Allocating RAM space for the token array is doable.

I currently have JSMN implemented and working well, and have been getting by though minimising the size of the JSON string (remove whitespace, ensure that only required JSON elements are present, etc) and maximising the heap space available to fit larger objects; however we're starting to feel the pinch as other areas of the code grow and start to demand more RAM.

I have been toying with the idea of abstracting the access to the underlying original JSON string so that it can still be backed by a raw array of bytes as it currently is, or alternatively a paged based solution that will only require a portion of the JSON string to be in RAM at once.

I'm unsure if this is even possible with the structure of JSMN, and even if it is, it is probably too much of a change to be considered based the 'light weight and fast' ethos (which I very much appreciate).

If I do get around to implementing it, I will certainly make the changes available.

chrismerck commented 4 years ago

Make strict mode the default / RFC-8259

Non-strict mode is important to the extent that JSON implementations differ in the wild (or, at least, to the extent that the "few standards" differ.)

a paged based solution that will only require a portion of the JSON string to be in RAM at once... if this is even possible with the structure of JSMN

JSMN is simple enough to already be "complete" --- so you should consider forking and modifying as you need, manually merging bugfixes if/when bugs are ever discovered.

Something I'd like to see is much clearer documentation of the functions, expectations in error cases, etc. Currently this requires code inspection & experimentation.

vector76 commented 4 years ago

Thank you for your work on jsmn. On my private copy I've added some functionality for easier traversal, in particular all tokens have a "first_child" and a "next_sibling". I added a very simple function to populate these values after parsing (speed is not important to me). And then I can iterate over children of a node by:

for (int child = tokbuf[node].first_child; child != -1; child = tokbuf[child].next_sibling) {
   ...
}

I am not asking anything from the jsmn project but if you would like to see the code I would be happy to share (not that it's difficult).

dominickpastore commented 4 years ago

Strict mode should follow RFC-8259 There are a few standards with slight differences for JSON. I believe it should be explicitly said that we would follow RFC-8259 as THE standard for our JSON parsing.

We needed this for a project I'm on. I went ahead and implemented it. See PR #194. The downside is there are some sacrifices made to non-strict mode (see the PR for details on that).

But I want to share my vision of the 'non-strict mode'. Basically, it's not a feature of JSMN, but rather a side-effect of "what would happen if we #undef some of the parser code". This eliminates certain sanity checks, and parses invalid but nicely looking JSON-ish text, with unquoted strings and the absence of the root object. This has never been a priority for JSMN (in my mind), and I don't think it deserves a separate implementation. If, by undef-ing some code form the RFC-compatible strict implementation we can achieve some syntax sugar for JSON - fine, if not - I'd rather forget about non-strict mode at all.

Additionally, I would rather invest into advanced documented supersets of JSON, like JSON5, rather than in the undocumented, barely known non-strict mode.

@zserge This seems reasonable to me. (And is why I did not worry about sacrificing some non-strict functionality for RFC 8259 compliance in the PR above.)

The rest of the proposed changes sound very nice. If we do add extra utilities, it would be nice to see them in either a separate header or able to be enabled/disabled with a #define.

I'd also like to propose a few additional improvements:

Single-object mode Currently, the parser consumes the entire input string. It parses multiple objects if necessary, returning JSMN_ERROR_PART if the last object is incomplete, even if there are complete objects before it. Sometimes, it would be useful to parse one object at a time. For example, if parsing a stream of JSON objects, there would be no need for the input buffer to end on object boundaries, as long as at least one complete object is present.

A new macro, JSMN_SINGLE, could be created that would cause jsmn to parse one object at a time and ignore the rest of the input.
More granularity for non-strict macros Some applications may have use for certain non-strict features but won't want all of them. For example, someone might want non-strict primitive validation for performance reasons (especially if PR #194 is accepted), but not non-string keys.

Separate macros could enable the various non-strict features, and they would all be enabled by default unless JSMN_STRICT is defined.
Improved docs The README and jsmn website are vague about a lot of the details. For example, the following are not mentioned at all:
- size for an object refers to key/value pairs, not the total number of tokens
- Object keys have size 1 and object values have the key as the parent, not the object
- The end field in a token is the index of the first character after the token, not the last character in the token
- If there is more than one object in the input, they all will be parsed.
These all make sense, but aren't documented. Some well-explained examples would be useful too, especially for traversing arrays and objects.

I plan to work on these in the near future and will create additional pull requests as they are ready, in case there is interest.

Edit: I've submitted pull requests for single-object mode and more granularity for non-strict macros. I'd be interested in updating the README to add more detail once some of the decisions here are finalized.

themobiusproject commented 4 years ago

From @pt300's original comments:

Make strict mode the default

Strict mode should follow RFC-8259

Strict mode following RFC-8259 should definitely be the default and any flags should extend this functionality.

Enable parent links by default

Letting parents be the rule rather than the exception would reduce the flags that end users need to use by default to increase speed and the cost of a small amount of memory with JSMN_LOW_MEMORY being available for smaller devices.

Bit flags for token types

This was the original reason for my fork of jsmn years ago and has become power behind my verification of RFC-8259 as well as rules for extending the base.

Token array traversing

This was the second reason behind my original fork; I needed a faster way to get from one sibling to the next. This, combined with parent, makes going to the next and previous sibling very efficient.

Down to @dominickpastore's additions

Single-object mode

I both agree and disagree with you on this point. It is true that the current implementation parses multiple json objects, but I believe that this should be the exception rather than the rule. Testing against the rules from nst/JSONTestSuite, multiple json objects in a buffer should produce a negative result. So instead of your JSMN_SINGLE macro, I propose JSMN_MULTIPLE or JSMN_MULTIPLE_JSON (for better understanding at first glance).

More granularity for non-strict macros

This would definitely be helpful. I do like your implementation of JSMN_PERMISSIVE_STRINGS, JSMN_PERMISSIVE_PRIMITIVES, and JSMN_PERMISSIVE_KEYS, but then I think that JSMN_PERMISSIVE should be the overarching macro rather than JSMN_NON_STRICT which is conforms more to the current JSMN_STRICT rather than the direction we are heading.

Improved docs

This also needs to be done. When I first started with the examples to try to figure out how to implement jsmn I was lost and it took more trial and error than it should have. I think we should also have a better example implementation. I have an explode function that shows all of the information you can get out of the parser which would show end users all of the variables as well as how to access them that may be useful in the examples. (I should get a PR just for this).

size for an object refers to key/value pairs, not the total number of tokens

First I propose changing size to children. I believe this would help clarify the field. Second, I see a key/value pair as one item, or child, of an object. The value of the key really doesn't have direct relationship with the object, it's only direct relative is the key. As such, I think the current implementation of the object's child being the key and the key's child being the value is the correct implementation.

Along these lines, I think we should also extend jsmnerr_t to include more descriptive errors to better help end users debug the json string. Definitely not as far as I've extended jsmntype_t, but a couple more errors beyond JSMN_ERROR_INVAL and JSMN_ERROR_PART would help clear things up.

I have made some of these changes in PR #197 and I intend to implement a few more. UTF-8 support is high on my list with more granularity following close behind (though the latter would be much faster to implement).

dominickpastore commented 4 years ago

A few comments:

Token array traversing

This was the second reason behind my original fork; I needed a faster way to get from one sibling to the next. This, combined with parent, makes going to the next and previous sibling very efficient.

Adding a field in jsmntok_t seems reasonable, only extra comment I have is this should probably be one of the exclusions for JSMN_LOW_MEMORY.

Single-object mode

I both agree and disagree with you on this point. It is true that the current implementation parses multiple json objects, but I believe that this should be the exception rather than the rule. Testing against the rules from nst/JSONTestSuite, multiple json objects in a buffer should produce a negative result. So instead of your JSMN_SINGLE macro, I propose JSMN_MULTIPLE or JSMN_MULTIPLE_JSON (for better understanding at first glance).

I would have preferred single-parsing to be the default as well, but I am hesitant to change this behavior since it would break backward compatibility. That might sound a bit silly when both you and I have already introduced major pull requests that break backward compatibility, but here is my ideology:

Strict mode should comply with RFC 8259. This necessarily requires breaking backward compatibility for strict mode. But once there, the RFC is law, and compatibility must be preserved.
For non-strict mode: 1) We should define non-strict mode precisely, i.e. describe each deviation from RFC 8259. We can't hope to preserve compatibility when we don't really know what is "compatible." 2) Breaking backward compatibility to get there is okay. 3) Once we define the exact behavior, those features should be protected from future backward incompatible changes.

This would mean that non-strict deviations should be few, limited in scope, and be useful, since they would become fully-supported features. (After all, if users of the library can't rely on a feature to continue working in future versions, it loses most of its utility. Why even bother?)
Breaking backward compatibility for intended behavior that was neither allowed nor prohibited by RFC 8259 should be avoided if at all possible. "Intended behavior" is a bit hard to define since the documentation for jsmn is fairly sparse, but if a behavior is documented in either the README or a test case, I would consider it "intended."

This last one is the category multiple-parsing falls into. Granted, there was only one test case I saw that applies to it, which is test_issue_27. According to that test, "{ \"name\" : \"Jack\", \"age\" : 27 } { \"name\" : \"Anna\", " should yield JSMN_ERROR_PART, but that is enough to confirm multiple-parsing is intended behavior (jsmn considers the input valid but incomplete).

As for RFC compliance of multiple-parsing: RFC 8259 only defines what makes up a single valid JSON text. The context surrounding it is not discussed at all. Thus, it is totally up to jsmn whether it should parse one object at a time or multiple, if it should care whether there is non-JSON garbage following valid objects, etc. (That does not mean JSONTestSuite is incorrect, it's just operating on different assumptions. If the input is supposed to be a single JSON text, input containing multiple objects is, of course, not that.)

More granularity for non-strict macros

This would definitely be helpful. I do like your implementation of JSMN_PERMISSIVE_STRINGS, JSMN_PERMISSIVE_PRIMITIVES, and JSMN_PERMISSIVE_KEYS, but then I think that JSMN_PERMISSIVE should be the overarching macro rather than JSMN_NON_STRICT which is conforms more to the current JSMN_STRICT rather than the direction we are heading.

I used JSMN_NON_STRICT for the main macro because we've all been calling it non-strict mode already, but I think either one is fine as long as it's documented. If there's demand, I can update my pull request.

size for an object refers to key/value pairs, not the total number of tokens

First I propose changing size to children. I believe this would help clarify the field. Second, I see a key/value pair as one item, or child, of an object. The value of the key really doesn't have direct relationship with the object, it's only direct relative is the key. As such, I think the current implementation of the object's child being the key and the key's child being the value is the correct implementation.

I agree that this behavior is a good idea (see discussion under "Arrays and objects as keys" on PR #194), my point was just that it should be documented. Changing the name feels like a step too far though--see notes above on backward compatibility.

themobiusproject commented 4 years ago

I would have preferred single-parsing to be the default as well, but I am hesitant to change this behavior since it would break backward compatibility.

My thought is that since we are changing the default functionality from non-strict to strict that we were breaking backwards compatibility already. If we are trying to follow RFC 8259 as strictly as possible (pun slightly intended), I think we should go all out and follow it as closely as we can. If the JSMN_NON_STRICT flag enables the permissive functionality that people are currently use to, this would just be one of the functions that they lose by default. JSMN_MULTIPLE_JSON would be one of the modular macros that enables functionality beyond the default.

As for test_issue27, I could argue that single primitives weren't intended for strict parsing either.

int test_simple_json(void) {
  check(parse("\"Hello world!\"", 1, 1, JSMN_STRING, "Hello world!", 0));
  check(parse("42", 1, 1, JSMN_PRIMITIVE, "42"));
  check(parse("true", 1, 1, JSMN_PRIMITIVE, "true"));
  return 0;
}

Both the "42" and the "true" return JSMN_ERROR_PART in strict mode. I think that the ability to parse multiple json strings was an unintended "feature" that ended up become the default mode.

For JSMN_NON_STRICT vs JSMN_PERMISSIVE, both work equally well, I agree. Personal preference that really isn't set it stone.

First I propose changing size to children. ...

I agree that this behavior is a good idea (see discussion under "Arrays and objects as keys" on PR #194), my point was just that it should be documented. Changing the name feels like a step too far though--see notes above on backward compatibility.

You're probably right on this one. The other option would be to start with something like the following to have the possibility to phase it out.

typedef struct jsmntok {
  jsmntype_t type;
  int start;
  int end;
  union {
    int size __attribute__((deprecated("use children instead")));
    int children;
  }
#ifdef JSMN_PARENT_LINKS
  int parent;
#endif
} jsmntok_t;

For non-strict mode: 1) We should define non-strict mode precisely, i.e. describe each deviation from RFC 8259. We can't hope to preserve compatibility when we don't really know what is "compatible." 2) Breaking backward compatibility to get there is okay. 3) Once we define the exact behavior, those features should be protected from future backward incompatible changes.

This would mean that non-strict deviations should be few, limited in scope, and be useful, since they would become fully-supported features. (After all, if users of the library can't rely on a feature to continue working in future versions, it loses most of its utility. Why even bother?)

Absolutely. Let us get into specifics over in issue #154.

dominickpastore commented 4 years ago

I would have preferred single-parsing to be the default as well, but I am hesitant to change this behavior since it would break backward compatibility.

My thought is that since we are changing the default functionality from non-strict to strict that we were breaking backwards compatibility already.

That is a fair point.

If we are trying to follow RFC 8259 as strictly as possible (pun slightly intended), I think we should go all out and follow it as closely as we can. If the JSMN_NON_STRICT flag enables the permissive functionality that people are currently use to, this would just be one of the functions that they lose by default. JSMN_MULTIPLE_JSON would be one of the modular macros that enables functionality beyond the default.

As for test_issue27, I could argue that single primitives weren't intended for strict parsing either.
int test_simple_json(void) {
  check(parse("\"Hello world!\"", 1, 1, JSMN_STRING, "Hello world!", 0));
  check(parse("42", 1, 1, JSMN_PRIMITIVE, "42"));
  check(parse("true", 1, 1, JSMN_PRIMITIVE, "true"));
  return 0;
}
Both the "42" and the "true" return JSMN_ERROR_PART in strict mode. I think that the ability to parse multiple json strings was an unintended "feature" that ended up become the default mode.

Okay, I think there are two issues here: 1) Whether multiple-parsing violates RFC 8259 and 2) if it should be the default in jsmn.

For (1): RFC 8259 ultimately just specifies a grammar. Something like {"a": 1, "b": 2} [1, 2, 3] does not follow the grammar, thus it is not a JSON object. But jsmn doesn't parse it as a JSON object. It parses it as two JSON objects, each of which conform to the RFC. While that is perhaps surprising (It's certainly not documented, and I was surprised to learn it behaved this way), it does not violate RFC 8259. The RFC doesn't say JSON objects must always be in a string by themselves, or JSON objects must not touch, or a list of JSON objects must look like \. This behavior should be documented, but there is no conflict with RFC 8259.

For (2): My original argument was that, as a behavior RFC 8259 is neutral on, we should preserve it in the interest of backward compatibility. As you point out, backward compatibility is right out the window when we make strict mode the default. As long as we are changing the defaults for some configuration macros, there's little reason not to make single-parsing the default if that is more useful. So, that becomes the question: is single-parsing a more useful default? I would prefer to have wider input than just @themobiusproject and me on that, but I haven't seen much other activity in a while, so that might be wishful thinking.

In fact, I see three options for this:

Multiple parsing. This is the current behavior. If multiple objects are present in the input string, all will be parsed. If the last object is incomplete, JSMN_ERROR_PART is returned for the whole thing.
Exact single parsing. The input string must contain precisely one RFC 8259 JSON object. If there is anything more present, the result is JSMN_ERROR_INVAL. If I understand, this is the default you are advocating for.
Liberal single parsing. The parser consumes input, returning successfully once it consumes a complete JSON object. Everything left in the string is ignored. If we are changing the default, this is the default I would advocate for: it allows for parsing either single objects or a stream of objects. If you check if there is any unconsumed input after the parse, you can ensure the input was precisely one object. Otherwise, you can get stream semantics: if jsmn_parse() returns successful, simply call it again to get the next object. (This is very close to the way I implemented single parsing in PR #196, although as writing this, I realize I should make a few tweaks to make it the most flexible.)

I am biased toward (3) being the default because the flexibility seems very useful (in fact, for my use case, I have a stream of JSON-formatted messages that need to be processed), but I can see the argument for (2): it's perhaps the least "surprising" way for jsmn to behave.

Also, while one of them has to be the default, I think it is reasonable to have all three be selectable options.

First I propose changing size to children. ...

I agree that this behavior is a good idea (see discussion under "Arrays and objects as keys" on PR #194), my point was just that it should be documented. Changing the name feels like a step too far though--see notes above on backward compatibility.

You're probably right on this one. The other option would be to start with something like the following to have the possibility to phase it out.
typedef struct jsmntok {
  jsmntype_t type;
  int start;
  int end;
  union {
    int size __attribute__((deprecated("use children instead")));
    int children;
  }
#ifdef JSMN_PARENT_LINKS
  int parent;
#endif
} jsmntok_t;

Unfortunately, anonymous unions aren't allowed under C89, or even C99 :( That said, I still don't think it's a major issue as long as size is well-documented.

Token array traversing

This was the second reason behind my original fork; I needed a faster way to get from one sibling to the next. This, combined with parent, makes going to the next and previous sibling very efficient.

Adding a field in jsmntok_t seems reasonable, only extra comment I have is this should probably be one of the exclusions for JSMN_LOW_MEMORY.

I want to add a little more follow up on this: Parent links nearly universally improve parsing speed, so if memory constraints aren't a concern, you basically always want them. Sibling links aren't as universal of a benefit: they don't improve parsing at all and only benefit applications with the right use case (needing to traverse arrays or objects but not needing to visit every token within).

I bring this up because sibling links seem to be a common request, but I think it's partly due to perceived difficulty of implementing array/object traversal. Without arrays/objects as keys and with the current semantics for size, it's actually fairly straightforward:

/* Assume token is a pointer to the current jsmntok_t */
size_t remaining;
for (remaining = 1; remaining > 0; remaining--) {
    remaining += token->size;
    token++;
}
/* token now points to the next sibling */

Of course, the performance benefits are not to be ignored, and I suspect the majority of applications will benefit from them, so sibling links should likely be the default. I just wanted to mention that there are non-memory-constrained applications that would have reason to turn off sibling links.

themobiusproject commented 4 years ago

In fact, I see three options for this:

Multiple parsing. This is the current behavior. If multiple objects are present in the input string, all will be parsed. If the last object is incomplete, JSMN_ERROR_PART is returned for the whole thing.

I think this point is what I disliked about multiple parsing by default; if a later json object fails then everything fails.

Exact single parsing. The input string must contain precisely one RFC 8259 JSON object. If there is anything more present, the result is JSMN_ERROR_INVAL. If I understand, this is the default you are advocating for.

Advocating for? Pretty close to that. If given the choice between 1 and 2, I would choose 2.

Liberal single parsing. The parser consumes input, returning successfully once it consumes a complete JSON object. Everything left in the string is ignored. If we are changing the default, this is the default I would advocate for: it allows for parsing either single objects or a stream of objects. If you check if there is any unconsumed input after the parse, you can ensure the input was precisely one object. Otherwise, you can get stream semantics: if jsmn_parse() returns successful, simply call it again to get the next object. (This is very close to the way I implemented single parsing in PR #196, although as writing this, I realize I should make a few tweaks to make it the most flexible.)

But then you add option 3...

Figuring out if there is more to the buffer that was passed in is simple (comparing the jsmn_parse len argument against tokens[0].end). I think my only aversion to this is the idea that in a stream two primitives next to each other require whitespace between each other where nothing else has that requirement. This isn't a coding issue, just a personal ocd type issue. That personal issue notwithstanding, option 3 would certainly be my preference over options 1 and 2.

Unfortunately, anonymous unions aren't allowed under C89, or even C99 :( That said, I still don't think it's a major issue as long as size is well-documented.

I recognize that anonymous unions are an extension of compilers rather than part of a standard, but I had to throw in one more option before allowing children to pass into the nether.

/* Assume token is a pointer to the current jsmntok_t */
size_t remaining;
for (remaining = 1; remaining > 0; remaining--) {
    remaining += token->size;
    token++;
}
/* token now points to the next sibling */

My goodness, that is much more elegant solution than I came up with. (Heading back to my code to make some changes.)

I would still want to include a next_sibling option (whether default or not though I would likely recommend it), but beyond that I think we should, in a separate optional file, include functions to help use the tokens that jsmn_parse provides. The examples that are currently given are lacking in that department.

To this point, I have an explode function, which I mentioned earlier, that visually helps people understand the data in the tokens and how to utilize them. I wrote it for a coworker to show how to use the tokens. A short example of the output of this (which uses the supplied library.json) is as follows:

    Token |      Type |    Start |      End |   Length | Children |   Parent |  Sibling |     |
----------+-----------+----------+----------+----------+----------+----------+----------+-----+-
        0 |    Object |        0 |      367 |      367 |        8 |       -1 |       -1 | VAL |  {
        1 |    String |        5 |        9 |        4 |        1 |        0 |        3 | KEY |    "name" :
        2 |    String |       13 |       17 |        4 |        0 |        1 |       -1 | VAL |      "jsmn",
        3 |    String |       23 |       31 |        8 |        1 |        0 |        5 | KEY |    "keywords" :
        4 |    String |       35 |       39 |        4 |        0 |        3 |       -1 | VAL |      "json",
        5 |    String |       45 |       56 |       11 |        1 |        0 |        7 | KEY |    "description" :
        6 |    String |       60 |      171 |      111 |        0 |        5 |       -1 | VAL |      "Minimalistic JSON parser/tokenizer in C. It can be easily integrated into resource-limited or embedded projects",
        7 |    String |      177 |      187 |       10 |        1 |        0 |       13 | KEY |    "repository" :
        8 |    Object |      192 |      264 |       72 |        2 |        7 |       -1 | VAL |      {
        9 |    String |      199 |      203 |        4 |        1 |        8 |       11 | KEY |        "type" :
       10 |    String |      207 |      210 |        3 |        0 |        9 |       -1 | VAL |          "git",
       11 |    String |      218 |      221 |        3 |        1 |        8 |       -1 | KEY |        "url" :
       12 |    String |      225 |      259 |       34 |        0 |       11 |       -1 | VAL |          "https://github.com/zserge/jsmn.git"
          |           |          |          |          |          |          |          |     |      },
       13 |    String |      269 |      279 |       10 |        1 |        0 |       15 | KEY |    "frameworks" :
       14 |    String |      283 |      284 |        1 |        0 |       13 |       -1 | VAL |      "*",
       15 |    String |      290 |      299 |        9 |        1 |        0 |       17 | KEY |    "platforms" :
       16 |    String |      303 |      304 |        1 |        0 |       15 |       -1 | VAL |      "*",
       17 |    String |      310 |      318 |        8 |        1 |        0 |       20 | KEY |    "examples" :
       18 |     Array |      321 |      344 |       23 |        1 |       17 |       -1 | VAL |      [
       19 |    String |      328 |      339 |       11 |        0 |       18 |       -1 | VAL |        "example/*.c"
          |           |          |          |          |          |          |          |     |      ],
       20 |    String |      349 |      356 |        7 |        1 |        0 |       -1 | KEY |    "exclude" :
       21 |    String |      360 |      364 |        4 |        0 |       20 |       -1 | VAL |      "test"
          |           |          |          |          |          |          |          |     |  }

This would be part of the better documentation that is sorely needed.

dominickpastore commented 4 years ago

Multiple parsing. This is the current behavior. If multiple objects are present in the input string, all will be parsed. If the last object is incomplete, JSMN_ERROR_PART is returned for the whole thing.

I think this point is what I disliked about multiple parsing by default; if a later json object fails then everything fails.

I agree, I found this to be difficult to work with, even if there are multiple objects to be parsed.

Liberal single parsing. The parser consumes input, returning successfully once it consumes a complete JSON object. Everything left in the string is ignored. If we are changing the default, this is the default I would advocate for: it allows for parsing either single objects or a stream of objects. If you check if there is any unconsumed input after the parse, you can ensure the input was precisely one object. Otherwise, you can get stream semantics: if jsmn_parse() returns successful, simply call it again to get the next object. (This is very close to the way I implemented single parsing in PR #196, although as writing this, I realize I should make a few tweaks to make it the most flexible.)

But then you add option 3...

Figuring out if there is more to the buffer that was passed in is simple (comparing the jsmn_parse len argument against tokens[0].end).

It's almost that simple...if there is trailing whitespace, this won't work. I was thinking about this while I wrote that last night and I thought of two ways to implement it:

The parser consumes all trailing whitespace. The check then becomes parser->pos == len || js[parser->pos] == '\0'. But to prevent applications from having to go into the jsmn_parser struct, it would be better to have a function for this check, maybe call it jsmn_eof() or jsmn_end().
The parser ignores trailing whitespace, but the jsmn_eof() function would consume it before doing the check from (1). I like this way because I think it will lead toward a cleaner implementation. With no need to consume trailing whitespace, jsmn_parse() can return as soon as it closes an array or object with no parent or parses a string or primitive with no parent.

I think my only aversion to this is the idea that in a stream two primitives next to each other require whitespace between each other where nothing else has that requirement. This isn't a coding issue, just a personal ocd type issue. That personal issue notwithstanding, option 3 would certainly be my preference over options 1 and 2.

It's worth mentioning, this isn't the only sticking point for unenclosed primitives. They are also troublesome for incremental parsing. Suppose we have the following two JSON entities:

"abcdefgh"
1.23456789

If they happen to get split when an application is doing incremental parsing:

"abcde
1.2345

the first results in JSMN_ERROR_PART, but the second returns successfully with a truncated primitive. Not ideal, but I don't think there's a great way around the issue. The best I can come up with is simply making sure that the next call to `jsmn_parse() recognizes that it's a split primitive and fixes the bad token. (That being said, I think this is not likely to be a problem in practice. While primitives can be unlimited length, they typically are fairly short, so they should generally fit into any input buffer entirely. But this should be documented.)

/* Assume token is a pointer to the current jsmntok_t */
size_t remaining;
for (remaining = 1; remaining > 0; remaining--) {
    remaining += token->size;
    token++;
}
/* token now points to the next sibling */
My goodness, that is much more elegant solution than I came up with. (Heading back to my code to make some changes.)

I would still want to include a next_sibling option (whether default or not though I would likely recommend it),

I think making it the default is reasonable. I think a lot of applications will benefit from the performance boost. And as straightforward as the above is, it's still not as easy as token = token->next_sibling.

Also worth mentioning: traversing backward is not so easy. It's not as common a need, but should there also be token->prev_sibling?

but beyond that I think we should, in a separate optional file, include functions to help use the tokens that jsmn_parse provides. The examples that are currently given are lacking in that department.

I agree. (This is also one of the things originally proposed at the top of this thread, of course.) That snippet above is straightforward, but it's not immediately obvious (though it might be more obvious with better documentation for size).

I'd like to see the following in a jsmn_utils.c/.h:

Next sibling
Key lookup in object
De-escape string
Determine if primitive token is int, double, true, false, or null
Primitive token to int (What about other sizes of int and unsigned?)
Primitive token to double

    Token |      Type |    Start |      End |   Length | Children |   Parent |  Sibling |     |
----------+-----------+----------+----------+----------+----------+----------+----------+-----+-
        0 |    Object |        0 |      367 |      367 |        8 |       -1 |       -1 | VAL |  {
        1 |    String |        5 |        9 |        4 |        1 |        0 |        3 | KEY |    "name" :
        2 |    String |       13 |       17 |        4 |        0 |        1 |       -1 | VAL |      "jsmn",
        3 |    String |       23 |       31 |        8 |        1 |        0 |        5 | KEY |    "keywords" :
        4 |    String |       35 |       39 |        4 |        0 |        3 |       -1 | VAL |      "json",
        5 |    String |       45 |       56 |       11 |        1 |        0 |        7 | KEY |    "description" :
        6 |    String |       60 |      171 |      111 |        0 |        5 |       -1 | VAL |      "Minimalistic JSON parser/tokenizer in C. It can be easily integrated into resource-limited or embedded projects",
        7 |    String |      177 |      187 |       10 |        1 |        0 |       13 | KEY |    "repository" :
        8 |    Object |      192 |      264 |       72 |        2 |        7 |       -1 | VAL |      {
        9 |    String |      199 |      203 |        4 |        1 |        8 |       11 | KEY |        "type" :
       10 |    String |      207 |      210 |        3 |        0 |        9 |       -1 | VAL |          "git",
       11 |    String |      218 |      221 |        3 |        1 |        8 |       -1 | KEY |        "url" :
       12 |    String |      225 |      259 |       34 |        0 |       11 |       -1 | VAL |          "https://github.com/zserge/jsmn.git"
          |           |          |          |          |          |          |          |     |      },
       13 |    String |      269 |      279 |       10 |        1 |        0 |       15 | KEY |    "frameworks" :
       14 |    String |      283 |      284 |        1 |        0 |       13 |       -1 | VAL |      "*",
       15 |    String |      290 |      299 |        9 |        1 |        0 |       17 | KEY |    "platforms" :
       16 |    String |      303 |      304 |        1 |        0 |       15 |       -1 | VAL |      "*",
       17 |    String |      310 |      318 |        8 |        1 |        0 |       20 | KEY |    "examples" :
       18 |     Array |      321 |      344 |       23 |        1 |       17 |       -1 | VAL |      [
       19 |    String |      328 |      339 |       11 |        0 |       18 |       -1 | VAL |        "example/*.c"
          |           |          |          |          |          |          |          |     |      ],
       20 |    String |      349 |      356 |        7 |        1 |        0 |       -1 | KEY |    "exclude" :
       21 |    String |      360 |      364 |        4 |        0 |       20 |       -1 | VAL |      "test"
          |           |          |          |          |          |          |          |     |  }

This would be part of the better documentation that is sorely needed.

That is incredibly useful. That should absolutely be part of the documentation.

themobiusproject commented 4 years ago

It's almost that simple...if there is trailing whitespace, this won't work. I was thinking about this while I wrote that last night and I thought of two ways to implement it:

Leading whitespace shouldn't be an issue either. I was planning on implementing it so when count > 0 and it reached the root level again jsmn_parse returns. If there is whitespace afterwards then it just becomes leading whitespace for the next object which passes.

Also worth mentioning: traversing backward is not so easy. It's not as common a need, but should there also be token->prev_sibling?

For the previous sibling I go to the toksuper's first child and follow token->next_sibling until the next sibling is the token I started at. This is essentially what I do when I am figuring out next sibling anyways.

I'd like to see the following in a jsmn_utils.c/.h:

Next sibling

Key lookup in object

De-escape string

Determine if primitive token is int, double, true, false, or null

Primitive token to int (What about other sizes of int and unsigned?)

Primitive token to double

I have a good key lookup function that I would be happy to share and compare with what you have. I am getting utf8 implemented to also go from the escaped \u format to utf{8,16,32} and back again.

My parser flags tokens as JSMN_PRI_LITERAL for true, false, and null, JSMN_PRI_MINUS if it is a negative number, JSMN_PRI_DECIMAL if it has a decimal, and JSMN_PRI_EXPONENT if it has an exponent. This is beyond its default JSMN_PRIMITIVE flag. This is also why I love jsmntype_t being bit flags.

That is incredibly useful. That should absolutely be part of the documentation.

I'm glad you like it. It has also really helped me debug jsmn as I have been modifying it.

themobiusproject commented 4 years ago

It's almost that simple...if there is trailing whitespace, this won't work. I was thinking about this while I wrote that last night and I thought of two ways to implement it:

Leading whitespace shouldn't be an issue either. I was planning on implementing it so when count > 0 and it reached the root level again jsmn_parse returns. If there is whitespace afterwards then it just becomes leading whitespace for the next object which passes.

Thinking about it more I see your point; if you reach the end of valid json object but there is only whitespace left in the buffer, then you could misinform an application saying that there is more data to be parsed.

To fix this issue I put a loop after a successful json object has been found right before the return count:

  while (parser->pos < len && isWhitespace(js[parser->pos])) {
    parser->pos++;
  }

rustyrussell commented 3 years ago

Ack on some single object mode! We're abusing it currently by setting toks[0].type = JSMN_UNDEFINED; then checking toks[0].type != JSMN_UNDEFINED && toks[0].end != -1) if we get JSMN_ERROR_PART.

This "works" at cost of wasted cycles if there is more than one object, since we discard the buffer covered by the first token and try again (this is a JSON command stream).