yaml / libyaml

Canonical source repository for LibYAML
http://pyyaml.org/wiki/LibYAML
MIT License
951 stars 316 forks source link

Fix quadratic behavior in `yaml_parser_fetch_more_tokens` #286

Closed dtolnay closed 4 months ago

dtolnay commented 6 months ago

The following program reproduces scan time that is quadratic in the size of the input document.

#include <ctype.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <yaml.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "usage: %s 100000\n", argv[0]);
    return 1;
  }

  errno = 0;
  char *rest;
  const long n = strtol(argv[1], &rest, 10);
  if (errno != 0 || rest == argv[1] || isspace(*argv[1]) || *rest != '\0') {
    fprintf(stderr, "invalid argument\n");
    return 1;
  }

  unsigned char *yaml = (unsigned char *)malloc(n);
  if (!yaml) {
    return 1;
  }

  yaml_parser_t parser;
  if (!yaml_parser_initialize(&parser)) {
    free(yaml);
    return 1;
  }

  memset(yaml, '[', n);
  yaml_parser_set_input_string(&parser, yaml, n);

  int finished = 0;
  do {
    yaml_token_t token;
    if (!yaml_parser_scan(&parser, &token)) {
      break;
    }
    finished = token.type == YAML_STREAM_END_TOKEN;
    yaml_token_delete(&token);
  } while (!finished);

  yaml_parser_delete(&parser);
  free(yaml);
  return !finished;
}

Before: With each doubling of input size, the runtime increases by a factor of four.

`time ./repro 10000`   0m0.169s
              20000    0m0.580s
              40000    0m2.244s
              80000    0m8.940s
              160000   0m39.625s

After: Runtime is linear in the input size. Program can handle inputs which are two orders of magnitude larger than before.

`time ./repro 160000`  0m0.490s
              1600000   0m4.710s
              16000000   0m46.292s
perlpunk commented 6 months ago

Thanks! This reduces the runtime. I'm don't know the simplekeys code completely, so I would like to have a closer look. There are still cases where the runtime gets quadratic (when it's not simply {{{), I'm currently running some benchmarks.

I'm wondering how this ABI change can be realeased. I guess we would need at least a new minor version and change the so file version in order to not break existing bindings?

dtolnay commented 4 months ago

To also handle [{"":[[{"":[[{"":[..., a different solution is going to be needed.