RFC 49: Python 3 migration for wptserve #49

Closed Hexcles closed 4 years ago

Tagging people who have been involved with Python 3 migration: @annevk @jgraham @stephenmcgruer @svillar @ziransun

I strongly suspect you're going to run into a lot of pernicious, annoying issues with this. You probably want to instead be stricter about always using bytes; any time I, and many other people with similar problems, have tried to be lazy about bytes vs strings, it's bitten us hard and been extremely annoying and frustrating.

I don't think there's a way to make Python3 emit bytes by default for untagged strings, but you can force Python2 to allow the b modifier, same as Py3. Unfortunately, that change causes untagged strings to switch to unicode, same as Py3, so it would imply some migration costs.

Thanks for the feedback, Tab. Yes I'm aware of the potential issues (see the risks) and am trying to make conscious and careful trade-offs here. That said, I could well be missing some important points, which is exactly why I think the RFC process would be very valuable here!

For the core use case (reading and serving bytes as is on the wire), my investigation suggests that most custom handlers would work without modifications. wptserve is different from bikeshed as it doesn't interact with "text" much, so we could get away with treating arbitrary data as latin-1-encoded strings, even though the semantics are wrong (printing, counting the length, etc. are broken). (Of course, we can't be sure until we try.)

I was in favour of using bytes everywhere as I believed (and still believe) that'd be the "correct" way, e.g. in https://github.com/web-platform-tests/wpt/pull/13246 , but I've become convinced that it'd require a herculean effort to migrate the hundreds of custom handlers and prevent future regression. As you said, there's no way to make unprefixed strings bytes in Py3 and the only way out would be to use the b prefix (which also works / is a no-op in Python 2) when needed (you can't just add b to every string literal). I don't think it's possible to check the usage programatically using lint or more advanced static analysis, which means that humans need to examine and modify the existing code (whereas in the recommended approach, we could fix just the handlers that throw and the tests that produce different results). If we didn't need to consider Python 2 compatibility, I'd probably still prefer this approach.

On top of the migration/maintenance cost, I also considered how the standard library and a bunch of popular HTTP-related libraries handle headers. All of them chose to use str instead of trying to ensure bytes. I feel like I was trying to fight the standard library, wrapping the API calls or even as @jgraham joked forking it.

I need to look more at this on Monday, but my initial reaction is that I'm very concerned with the proposal. I think using the str type is going to cause a bunch of hard-to-understand issues.

My proposal to "fork the stdlib" wasn't a joke, but also wasn't as far-reaching as that implied; it would specifically mean not using the stdlib parts we currently use for reading and writing HTTP requests specifically.

If we allow strings these "technically byte" APIs better throw for code points higher than U+00FF. I would favor using bytes though so it's clearer what's going on. (There are a lot of tests where the specifics around this matter.)

If we allow strings these "technically byte" APIs better throw for code points higher than U+00FF.

That's a good point. Eventually this would throw when calling encode('latin-1'): UnicodeEncodeError: 'latin-1' codec can't encode character '\u01ff' in position 0: ordinal not in range(256) But we could add assertions to make this easier to understand.

There are a lot of tests where the specifics around this matter.

If you look at these examples, would those suffice the testing purpose? My cursory scan of the code base suggests that the second form is the most common. Or does it confuse you cognitively that this is a text string in Python 3?

I'm probably not the best person to ask that question as I'm intimately familiar with HTTP and text encodings. ("latin-1" might be confusing to some as on the web it means windows-1252.)

OK, I've read this in more detail now, so I think I understand the tradeoffs being made, although I wouldn't say I yet have a good feeling for whether they're the right ones to make. The suggested approach is the exact opposite to the one we've been taking with all the internal code (where we have tried to be clear about what's Text and what's Bytes and use those types consistently between 2 and 3 where possible). But it's of course possible that the different constraints here wind up giving us a different optimal solution. Having said that I also think that our requirements are very different to normal HTTP libraries, so the prior art there isn't as relevant as it might seem.

I wonder if we can enumerate all the API surface we have that deal with strings and are interfacing with the stdlib. I don't think there are so many places where we directly expose the stdlib and in those places it may be possible to wrap-or-replace. So it at least seems conceviable that the concern that we might be able to mediate access and provide a better API.

I also don't think that linting for bytes-vs-text is the only possible way to enforce that we get the correct types; runtime asserts for example would cause tests to error and we could ensure that we get a useful output.

I wonder if we can enumerate all the API surface we have that deal with strings and are interfacing with the stdlib.

With the help of typeshed, we should be able to get that list relatively easily.

I don't think there are so many places where we directly expose the stdlib and in those places it may be possible to wrap-or-replace. So it at least seems conceviable that the concern that we might be able to mediate access and provide a better API.

I agree that cleaning up our API surface wouldn't be hard, but my main concern is still around the custom handlers.

runtime asserts for example would cause tests to error and we could ensure that we get a useful output.

Is there a way to implement such asserts in Python 2? AFAIK, you can't actually tell the difference between b"..." and "..." at run time, which means we kinda have to bring up the whole Python 3 CI before being able to enforce such rules.

AFAIK, you can't actually tell the difference between b"..." and "..." at run time,

Correct; b or u just opt your string literal into one of the modes, and without them a plain literal chooses one according to python version and from __future__ imports; there's no way to telling whether a string was produced from a tagged literal or an untagged one.

I was in favour of using bytes everywhere as I believed (and still believe) that'd be the "correct" way, e.g. in web-platform-tests/wpt#13246 , but I've become convinced that it'd require a herculean effort to migrate the hundreds of custom handlers and prevent future regression. As you said, there's no way to make unprefixed strings bytes in Py3 and the only way out would be to use the b prefix (which also works / is a no-op in Python 2) when needed (you can't just add b to every string literal). I don't think it's possible to check the usage programatically using lint or more advanced static analysis, which means that humans need to examine and modify the existing code (whereas in the recommended approach, we could fix just the handlers that throw and the tests that produce different results). If we didn't need to consider Python 2 compatibility, I'd probably still prefer this approach.

From my personal experience migrating the code in tools/, static analysis won't help in this case as you mentioned. Python is a dynamic language after all.

After reading your proposal I'm wondering whether the "herculean" effort you mention is about complexity or quantity. If we're talking about the former then I'd agree that making some trade-offs and KISS might be the right approach. But if the concern is the amount of required changes, then I think we should seriously consider the alternative of dealing always with bytes. These kind of "mechanical" changes allow great levels of parallelization (many people working on them at the same time) and after some time it'd be relatively easy to foresee the amount of time we'd need for the transition.

Is there a way to implement such asserts in Python 2? AFAIK, you can't actually tell the difference between b"..." and "..." at run time, which means we kinda have to bring up the whole Python 3 CI before being able to enforce such rules.

We can fully control the parsing of the code here. So we can, if we want, do things like fail the parse for unprefixed literals, or treat all literals as bytestrings unless otherwise specified (c.f. https://www.mercurial-scm.org/repo/hg/rev/1c22400db72d)

I'm probably not the best person to ask that question as I'm intimately familiar with HTTP and text encodings. ("latin-1" might be confusing to some as on the web it means windows-1252.)

@annevk Yeah we can use the more technically correct term iso-8859-1.

In addition to the scale, I'm concerned by the upkeeping after the massive manual conversion (which is challenging, but as @svillar pointed out, can be parallelized). I think if we were to do this, we would need to either hook into the loading mechanism to check for regression (see @jgraham 's comment above, but probably not overwrite) or set up a full Python 3 CI to compare (a subset of) results (which requires us to get almost everything else working in Python 3 first). @svillar @ziransun IIRC, we can already run wpt run and produce the JSON in Python 3 now? I'd very much prefer running Python 3 as opposed to hooking into the loading mechanism.

produce the JSON in Python 3 now

What do you mean by 'produce the JSON'?

Ahh sorry I meant wptreport.json, which would allow us to compare results against Python 2.

Note that we don't have to do the hg thing of actually altering the source, we can do something like tokenize the source using the built-in tokenize module to validate that all strings are specified with a prefix. We can also make that opt-in during the transition with a magic comment that we remove once we've converted all the existing callers. That means we can make things fail before tests even reach CI which reduces the number of cases where we have to explain how to interpret CI results.

I think we (including me) don't have enough concrete evidence (e.g. regarding how much work the bytes conversion in custom handlers would require), and it's clear from the discussions so far that most people prefer to have clear and correct semantics, so I'm proposing that we try the "alternative approach", concretely:

I will double check the wptserve APIs to make sure they always return bytes (note that this is basically the status quo, so I may fix the API which doesn't currently always return bytes without landing a RFC).
I will propose a conversion guideline for custom handlers.
We then try converting a bunch of custom handlers. I'm thinking maybe html/, in other words try https://github.com/web-platform-tests/wpt/pull/22402 again with the new guideline, if @annevk doesn't mind reviewing, or some other people volunteer.

At that point, we can consider how we'd like to proceed. We'd need both consensus around the approach and commitment to work (writing and reviewing the PRs).

I consider this a relatively low-cost experiment and we can always come back to the other approach. WDYT?

I'm happy with that plan. I think we do need a way to enforce that we get correct semantics going forward; presumably this comes under "conversion guidelines" (I also agree it may be premature to actually implement the enforcement mechansim until such a time as there's a clear decision on which path to take).

This plan SGTM, with one caveat/nit-pick:

We then try converting a bunch of custom handlers. I'm thinking maybe html/, in other words try web-platform-tests/wpt#22402 again with the new guideline, if @annevk doesn't mind reviewing, or some other people volunteer.

I am concerned that html/ is too large for a trial run, since the 'only accept bytes' approach requires us to examine and convert every single file. By my rough count, that's 50 files in html/:

$ find html -name "*.py" -not -path "*/.virtualenv/*" | wc -l
50

To me a trial would be more like: pick 5-10 Python handler files at random, and try those.

I clearly did not count the files :) I thought the PR included everything in html. I was also thinking that trial should be about the same size as that PR (<10 files).

@Hexcles iso-8859-1 means windows-1252 on the web too. See https://encoding.spec.whatwg.org/. In standards we use https://infra.spec.whatwg.org/#isomorphic-decode and https://infra.spec.whatwg.org/#isomorphic-encode.

Happy to help review.

Thanks, @annevk . I'll use "isomorphic decode/encode"; we don't really care about the character sets.

Sorry for the delay. I've done a cursory check of wptserve and found few issues (I'll send some PRs later). Keys and values in headers are always byte sequences (though the getter accepts both text and byte strings, so headers.get("cookies") works in both Python versions).

I think we can start experimenting converting a few custom handlers (I suggest we try those in https://github.com/web-platform-tests/wpt/pull/22402 first), @ziransun @stephenmcgruer . I'm proposing the following guidelines:

Headers: always use binary strings for both keys are values; prefer adding b"" prefixes to encoding/decoding the other operands. This also applies to basic auth (which is constructed from headers): request.headers.auth.{username,password} are byte strings.
- Cookies: both names and values of cookies are binary strings (blocked by https://github.com/web-platform-tests/wpt/pull/23327).
Request params: both keys and values for GET and POST params should be ~~text strings~~ binary strings (percent-decoded but not string decoded). Note that they are currently native strings and work is needed to make them binary (blocked by https://github.com/web-platform-tests/wpt/pull/23284).
Response body: ~~always use text strings.~~ The writer takes both text and binary strings, so native strings would technically work most of the time but we want to prefix everything to get consistent semantics. The rules are:
- If the string contains only ASCII characters, either type will work. Choose the one that requires less encoding/decoding (e.g. if the string is constructed from headers, then use bytes; if the string is constructed from query params, then use text).
- If the string contains escaped code points (e.g. “\x0F”, \012”), use bytes.
- If the string contains non-ASCII characters, use text. Note that the Response class will use UTF-8 to encode the output by default. If a handler needs other encoding, it probably already has explicitly set response.encoding.

Bottom line: make sure all strings are either always text or always bytes; all string literals in handlers should be prefixed with b"" or u"".

Thanks @Hexcles ! How about you and I work today to put together a short conversion guideline?

I suggest we try those in web-platform-tests/wpt#22402 first

To be clear, my reading of this is that you suggest we try the exact 8 files touched by that PR (as opposed to all the handlers in html/).

Response body: always use text strings. The Response class defaults to utf-8. If the handler needs other encoding, it probably already has explicitly specified the encoding.

Unless I'm missing something, that's not quite true; if we're given binary data in the Response class we pass it through. That seems like reasonable and sensible behaviour we should keep, but it does mean things could be different between py2 and py3 unless we force people to annotate strings with either u or b.

@jgraham I stand corrected. I've updated the guidelines above. Any other concerns?

On a second thought, I'm not sure whether request params should be text. These percent-encoded strings "should not cause UTF-8 decode without BOM or fail to return failure" and "using anything but UTF-8 decode without BOM when the input contains bytes that are not ASCII bytes might be insecure and is not recommended." (https://url.spec.whatwg.org/#percent-encoded-bytes) But I suppose tests might want to verify exactly that, so it makes sense to return binary bytes, too.

Actually, if we make all request params binary, we might as well make cookies binary, too, since they share the same underlying data structure (MultiDict) and the same rationale also applies.

Note that this will break a lot of things in Python 3, pushing us backward in terms of test pass rates in Python 3.

On request parameters; making them bytes-only is going to hurt in terms of work-needing-done, since it will break every use of stash (will need to either make stash take bytes, or decode everything being put in stash), it will break templating (I guess we'll fix that in the template handlers), and there's just a lot of places in file handlers that take parameters and pass them to other code (e.g. string operations).

That said, I think we do need to do it for consistent semantics. Would be interested if the file-handlers experts (well, basically @annevk) agree.

For representing the value of the name query parameter in the URL /?name=%E2%82%AC either b"\xE2\x82\xAC" or u"\u00E2\u0082\u00AC" are reasonable, but u"\u20AC" (result from UTF-8 decode) would not be. b"\x80" would be a bit clearer as in the u case the upper bound for a code point would have to be \u00FF. Hope that helps.

@Hexcles there's a dedicated cookie API for Set-Cookie? If so, it definitely would be most useful if that can set any kind of header value as otherwise a lot of cookie-related tests would simply have to use the headers API.

@annevk Thanks! I don't like u"\u00E2\u0082\u00AC" so I'll go with b"\xE2\x82\xAC".

There's https://web-platform-tests.org/tools/wptserve/docs/request.html#wptserve.request.Request.cookies for dealing with cookies.

I've merged https://github.com/web-platform-tests/wpt/pull/23284 and https://github.com/web-platform-tests/wpt/pull/23327 . Now almost everything in wptserve is binary string, with the notable exception of URLs/paths.

@ziransun 's https://github.com/web-platform-tests/wpt/pull/23363 (not ready for review yet) will give us an idea of how the migration will look like.

As I understand it, https://github.com/web-platform-tests/wpt/pull/23363 is now ready for review and should be able to give an idea of what the outcome of this RFC would look like. Please take a look, all.

what the outcome of this RFC would look like

Minor correction: what the outcome of the "alternative" approach in the RFC as of now would look like (I haven't updated the RFC yet)

Hi everyone on this thread, I've updated the RFC to reflect our latest consensus, switching the previously recommended and alternative approaches. I also incorporated the guidelines from the earlier comment, which has been tested in this trail PR, into the RFC.

Please take a look again. Thanks!

@Hexcles we discussed a method for comparing the the responses of custom handlers between python 2 and 3. Is a script for that something you expect to have in place when we begin making/reviewing changes?

@foolip I'll look into that in parallel, but don't want to wait for / block on that idea.

As far as I know, we are now at 10 days without any objections; per the RFC process I am going to merge this later this afternoon.

web-platform-tests / rfcs

RFC 49: Python 3 migration for wptserve #49