Open pt300 opened 5 years ago
Ok, at this point I believe it would be not only beneficial but is actually needed to be able to properly deal with fixing strict mode. The fact how code for both modes is shared it would be easy to break one mode when trying to improve the other.
Actually, I think now it wouldn't be such a bad idea to split jsmn into strict and non strict version, still selectable with a define. This should make fixing bugs for each mode a lot easier.
I would hesitate to split the code for strict and non-strict mode. At that point, you start to violate the DRY (i.e. don't repeat yourself) principle. After the split, they would certainly have a certain amount of duplicated code, and then, that means making changes in two places whenever updating the common code. Suppose you accidentally make a mistake in one and not the other--now you have an undocumented difference between them that might go unnoticed.
Instead, I would lean toward making non-strict into a few simple relaxations of requirements that might be useful, such as allowing flexible primitives and allowing primitives to be keys. (Similar to what @zserge suggested in another comment). After all, this is primarily a JSON parsing library. Parsing things that are not JSON is icing on the cake, but it isn't the main goal here. Non-strict mode should not make maintaining the code too complicated.
In fact, those two relaxations are all I would provide in non-strict mode. To be more specific:
{}[],:"
)Allowing objects and arrays as keys starts adding extra complication (see issue #193 which I just created a little while ago), so I lean toward not even allowing that. It's useful in some cases, sure, but not valid JSON, so no reason to go out of the way to support it.
Non-strict mode also currently allows strings and primitives as the root token, but since that is allowed under RFC-8259, I would say that should simply become part of strict mode.
In fact, those two relaxations are all I would provide in non-strict mode. To be more specific:
- Allow primitives to be made of any character that doesn't have special meaning in JSON (whitespace and any of
{}[],:"
)- Allow primitives to be object keys, not just strings
I certainly agree that objects and arrays should not be allowed as keys and your issue #193 perfectly states why; it leads to too much ambiguity with the tokens we have.
Beyond these two relaxations we also already have the ability to parse multiple json objects in a single string.
The last rule that I can think of now is being able to have key/value pairs and arrays at the root level. If you have a colon after a token, then the previous token becomes a key and the next token becomes a value. You can make sure you don't have two keys following each other a : b : c
would throw an error.
Commas would become ambiguous at the root level as a, b, c, d
and a b c d
would produce the same tokens (start and end not withstanding). Commas would only throw errors if you have two commas in a row.
I think that the target I am shooting for are root level lists like the following with whatever white space the user desires.
a b c d
a : b c : d
a : b c d
With that and the ability to have primitives as keys and made of any non-special character, I think we would match but document what we currently have.
Do we want to allow key/value pairs in arrays? This would make objects and arrays the same thing.
I had this typed already, but we had a power outage last night before I could submit it.
Since my previous comment here predates my pull request, allow me to revise my list of deviations. These are the ones I implemented in PR #194:
{}[],:"
)\0
and they may not end with a single backslash)(3) is the new one. This actually wasn't a feature at all before, but it seemed reasonable since non-strict allows something similar for primitives. This way, applications can implement their own escape sequences or allow tabs, newlines, or other control characters in strings.
Beyond these two relaxations we also already have the ability to parse multiple json objects in a single string.
I discussed this a lot in #159, but I don't see this as a deviation from RFC 8259. Thus, even if we make single-parsing the default, I see multiple-parsing as totally separate from strict/non-strict mode.
The last rule that I can think of now is being able to have key/value pairs and arrays at the root level. If you have a colon after a token, then the previous token becomes a key and the next token becomes a value. You can make sure you don't have two keys following each other
a : b : c
would throw an error.Commas would become ambiguous at the root level as
a, b, c, d
anda b c d
would produce the same tokens (start and end not withstanding). Commas would only throw errors if you have two commas in a row.I think that the target I am shooting for are root level lists like the following with whatever white space the user desires.
a b c d a : b c : d a : b c d
The current version of jsmn allows these sorts of things at the root level, but personally, I think it shouldn't. I don't think there are many use cases for this. If you want key/value pairs, just use an object, and if you want a list of things, just use an array. (Note that I see multiple objects in sequence at the root level differently from arrays: they are separate, disconnected objects. E.g. you have a microcontroller that receives commands formatted as JSON objects. Multiple objects back to back aren't a list of related things, they are simply separate commands.)
Do we want to allow key/value pairs in arrays? This would make objects and arrays the same thing.
Similar would be allowing unpaired elements in objects. (They wouldn't quite be the same thing unless we modified both.) I think doing that makes a little more sense. I can't quite put my finger on why, but unpaired elements in objects just feels less wrong than key/value pairs in arrays.
I have something like this in the valueless-keys
branch in my fork. I haven't submitted a pull request yet because I already have a few outstanding. It allows objects to contain keys with no value. Like this:
{
"a": 1,
"b",
"c": 2
}
Arguably, keyless values could be more useful than valueless keys, but valueless keys are much simpler to implement (e.g. if the parser sees an array opening after an object opening, with valueless keys, it can reject right away. With keyless values, it can't know whether an array is allowed until it sees whether a colon or comma/object closing follows). So, in the interest of minimizing maintenance headache, I stuck with valueless keys.
Since each non-strict feature adds to the maintenance burden, we should also consider whether this is really a useful feature. A similar result could be accomplished with {"a": 1, "b": null, "c": 2}
, but that's ugly. Where I see this being useful is something like this:
{
"from": "alice@example.com",
"to": "bob@example.com",
"subject": "An important message",
"urgent",
"body": "Lorem ipsum dolor sit amet..."
}
Is this useful enough to be worth adding as a non-strict feature?
I believe it would be really beneficial to create a proper definition of how non-strict mode works. I have looked at the explanation at https://zserge.com/jsmn.html but even example given there doesn't really work well with the rules stated above it. The idea is somewhere there but it's not as obvious as it should be and that can and does introduce problems when looking for bugs in non-strict mode.