Manifest vNEXT (trie) - Githubissues

gsnedders commented 5 years ago

Posting this as an issue, as it's really just somewhere to get a bunch of opinions on vague ideas rather than anything concrete:

We have a number of things in the manifest currently which really exist to deal with older generated manifests (which contain a number of bugs), see especially comments containing MANIFEST6 in manifest.item.

I think it's time we start considering what changes we want to make if we're going to bump the version, given we can certainly go further than just forcing regeneration to get rid of any old bugs.

There are some reasons why moving away from JSON seems tempting, but I don't know how justified they are (on my home computer, parsing/serializing the JSON with ujson takes ~300ms, and ~500ms with json). Older checkouts of wpt would fail horribly if they couldn't read the MANIFEST.json (anything prior to https://github.com/web-platform-tests/wpt/pull/10323 landing on 5 Apr 2018), but presumably we wouldn't use the .json extension if we weren't using JSON. Some people have spoken about protobuf or msgpack, and both are somewhat reasonable suggestions if we want to move away from JSON.

I would be pretty tempted to move tests to being stored in a path trie (so instead of tests["reftest"]["css/CSS2/bidi-005.xht"] we'd have tests["reftest"]["css"]["CSS2"]["bidi-005.xht"]), so that we can entirely skip directories (e.g., for ./wpt run --include html/ firefox we don't have to iterate through the entire dict of tests to find only those included). If we moved away from JSON, we might be able to do some sort of lazy parsing or skip over dicts we don't care about and avoid the not insubstantial cost of allocating Python data structures for them.

It might also be nice to use some format whereby we can simply append to the file for a small number of changes, rather than having to re-serialize the entire file, but that might not be worth it if we can decrease the cost of serialising.

jgraham commented 5 years ago

I believe there are concerns about the format not being text-based; both Chrome and Servo check the manifest in to their repositories, so having something binary is probably bad.

Hexcles commented 5 years ago

If we moved away from JSON, we might be able to do some sort of lazy parsing or skip over dicts we don't care about and avoid the not insubstantial cost of allocating Python data structures for them.

I've seen a few serialization formats that can skip/lazy-parse predefined fields, but not entries in a map. It might be technically possible to do that in length-prefixed formats (like protobuf), but it usually requires a custom loader.

I'd hope to use a widely supported industry format (absolutely do not roll our own). Selfishly speaking, Go binding is critical :)

gsnedders commented 5 years ago

@Hexcles how do you feel about binary formats, given Chromium has it checked in to its repo?

Hexcles commented 5 years ago

@gsnedders it'd be indeed easier to directly check in text files, but Chromium does have infra in place to deal with binary dependencies, so I can be persuaded to write some more code to support binary metadata if the benefit is overwhelming.

I'm not yet promoting protobuf at this stage, but just want to point out that protobuf has a nice text format :)

jgraham commented 5 years ago

Protobuf would be the first binary dependency required by all users, I think. That isn't necessarily fatal but does have some non-trivial cost associated with it.

gsnedders commented 5 years ago

FWIW: I don't think the 300ms/500ms is the end of the world if we can make everything else quicker. It's not ideal, but it's not a huge cost.

jgraham commented 5 years ago

Note there's also a jsshell use case where cutting into that 300ms might be siginificant; see https://bugzilla.mozilla.org/show_bug.cgi?id=1515039 (it seems like we'd really like to make the time to update and extract all jsshell tests less than 1000ms, and even that may cause complaints)

gsnedders commented 5 years ago

I mean I expect if we RIIR we could manage to get quick enough for that… But again that's back to having a binary dependency, and I don't really want to maintain two implementations to avoid that.

jgraham commented 5 years ago

Right, it's hard to see how to meet that target, but if we do want to then part of it will likely be making the manifest faster to parse. But from my point of view, there isn't much difference between requiring protobuf and requiring a Rust rewrite; both would be a binary dep for ./wpt and require some special handling for gecko.

gsnedders commented 5 years ago

Oh, one problem we currently have is how we deal with character encoding for file paths. Being on Python 2.7 as long as we're encoding them as JSON we need them to be a sequence of Unicode scalar values (i.e., have no surrogate pairs), though all the Python 2 APIs expect byte sequences (and remember paths are arbitrary 8-bit words on POSIX systems and 16-bit words on Win32). We currently waste quite a lot of time converting from UTF-8 encoded JSON to 16/32-bit code units for the unicode type to UTF-8 bytes.

jgraham commented 5 years ago

We 100% can insist that all paths can be represented as plain UTF8 i.e. without lone surrogates.

gsnedders commented 5 years ago

I think that's not unreasonable, though I do worry a bit about the cost of UTF-8 validating everything (but I think we have to, otherwise json.dump will throw).

Anyhow, as far as the manifest format goes, it sounds like we may as well stay with JSON, given the format isn't the fundamental performance problem, and if we're adding Python extensions that are required we may as well rewrite the manifest in C/Rust/Cython/whatever and get a similar perf gain.

foolip commented 5 years ago

@gsnedders do you want to turn this into a PR, or is the proposal still not well formed enough?

gsnedders commented 5 years ago

We've still been iterating a fair bit on the manifest-path-trie branch, and exploring different options; I don't think it really makes sense to write a fully-formed RFC at this point.

gsnedders commented 4 years ago

Another thing that has come up repeatedly is the use of test type at the top level of the manifest; we store the test type in the path_hash to speed up looking up a given path, but this essentially means we end up duplicating all the paths another time.

It's not clear what advantage we get by having this indirection, and it has clear costs (it bloats the size of the JSON significantly and it means we have to keep the two data structures in sync).

The only real advantage is when we want to filter by test-type, but this seems like a relatively rare use-case. If we're filtering only by test-type, we're going to be iterating through thousands of tests regardless of anything else and it's not clear the cost of iterating through all tests would be significant. If we're filtering by test-type and test-path, the path is a much more unique string, so practically we're going to search by path first then filter for type.

IIRC, @jgraham had opinions here.

jgraham commented 4 years ago

I don't recall what my opinion was, but it's not a rare thing to filter by test type; it's the way that our CI works and gecko's CI works. And I think there's a bunch of code that assumes that's a fast thing to do.

That said it's definitely possible that it's the wrong data model and that the extra bloat is a higher cost than iterating all the things.

gsnedders commented 4 years ago

I mean when we're filtering by test type it's normally when we're running all tests, and therefore the iteration cost is less significant?

As for the jsshell use-case: I think really for that we need a separate section listing jsshell tests? We still need to check everything compared with the mtime cache in case tests have become new jsshell tests, so we still have a relatively high iteration cost, especially on Windows.

stephenmcgruer commented 4 years ago

The related RFC (and the implementation) landed, so I think can be closed.

web-platform-tests / rfcs

Manifest vNEXT (trie) #20