zserge / jsmn

Jsmn is a world fastest JSON parser/tokenizer. This is the official repo replacing the old one at Bitbucket
MIT License
3.68k stars 783 forks source link

type int problematic for size, start and end indicators #44

Open zserge opened 9 years ago

zserge commented 9 years ago

Large strings produce weird results when the range of int is exceeded due to undefined behaviour regarding integer overflow.

Could you either change the type to unsigned int or make the type configurable (a la

#!c
#ifndef INDEX_TYPE
#define INDEX_TYPE int
#endif

typedef struct {
    jsmntype_t type;
    INDEX_TYPE start;
    INDEX_TYPE end;
    INDEX_TYPE size;
#ifdef JSMN_PARENT_LINKS
    INDEX_TYPE parent;
#endif
} jsmntok_t;

zserge commented 9 years ago

I don't think unsigned int is an option, because we need a special value for "undefined", which is now -1. I wonder if there is any practical platform where this problem matters? E.g. on most 32-bit systems int is 2^32, which should work fine for JSONs less than 2GB. On smaller systems like AVRs int is 16 bit addressing up to 32KB strings, which is still a lot for a tiny 8-bit CPU with a few kilobytes of RAM.

joeljk13 commented 9 years ago

You could make it an ssize_t. It's signed, so -1 still works, but it has a much larger range on many systems.

I think you could also use pointers, rather than indexes (using NULL when you would've used -1), but this would break backwards compatibility, so I think ssize_t is the best option.

GBuella commented 8 years ago

First of all, this is an awesome little library, thanks Zserge. I also don't care much for non ascii strings, and it is obvious that this library wouldn't help with that, so that's really not a problem. But having a signed overflow problem is does not seem like an obvious thing while looking at the documentation, or the code. I would even say it is misleading. If the code can ever fail, that is a bug, period. According to Murphy's law of software development, if it can ever fail, it will fail. If it is ever going to be a security problem, and people see it was known and ignored - those people are going to be angry. Please, at least state it in documentation. By the way, I don't see a problem with using size_t instead of int. The special value can be SIZE_MAX instead of -1 -- not that SIZE_MAX definitely can not be a valid index in any array. The largest theoretical valid index is SIZE_MAX-1 . The current state of things is pretty ugly: size_t len argument used in json_parse, unsigned int pos used for indexing, signed int used for returning results. This is very incosistent, ugly. Please consider it, it would not take away any of the simplicity.

Also, joeljk13: ssize_t is a type in POSIX C, while size_t and SIZE_MAX are available in freestanding ISO C as well -- thus pretty much everywhere.

GBuella commented 8 years ago

I just tried what happens, if json_pares is given a too-large string, an output from a test program:

token[0] start: 0 end: -2147482627 size: 2 token[1] start: 2 end: 6 size: 1 token[2] start: 9 end: -2147482646 size: 0 token[3] start: -2147482642 end: -2147482637 size: 1 token[4] start: -2147482634 end: -2147482629 size: 0

Now, as a simple hypothetical example, image a software, that just reads data from a network socket, tries to json_parse that data using jsmn. An attacker somehow manages to feed over 2 gigabytes of data, so the attacker can creates a token that starts at index -2 billion. While trying to access this token, the said software segfaults. I know, this is not something that is going to happen a lot, it's just the first thing I can come up with in 10 minutes.

schneidersoft commented 6 years ago

Simple solution: Don't allow ridiculously large input?

If you are reading data from a socket and the client is giving you 2+Gb of data... that connection needs to close();

The maximum size of any input is limited by the maximum index we can hold in token.start/end +1 for adding this to the documentation. +1 for using unsigned and say (unsigned)-1 for undefined (solves the problem of negative index)

Having the size configurable is a plus, but even using size_t does not solve the problem; theoretically even size_t will eventually wrap.

On the other hand, since the input buffer is held in memory in it's entirety the system will run out of memory first.