weppos / publicsuffix-go

Domain name parser for Go based on the Public Suffix List.
MIT License
190 stars 33 forks source link

Accept A-labels #31

Closed jsha closed 8 years ago

jsha commented 8 years ago

In https://github.com/letsencrypt/boulder/pull/2278 we have an issue where TLDs that are IDNs are not recognized by Boulder as ending in a public suffix. Boulder currently calls publicsuffix.DefaultList.Find with the A-label form of the domain.

We have a few options:

  1. Boulder could convert to U-label for querying publicsuffix
  2. publicsuffix-go could switch to looking up by A-label by default
  3. publicsuffix-go could offer a different set of methods for looking up by A-label.

I think (2) is my ideal solution. What do you think?

pfigel commented 8 years ago

FWIW, Go's x/net/publicsuffix accepts both U- and A-labels, but does not work accurately for suffixes with multiple U-labels:

EffectiveTLDPlusOne("foo.example.xn--o1ach.xn--90a3ac") == "example.xn--o1ach.xn--90a3ac"
EffectiveTLDPlusOne("foo.example.упр.срб") == "упр.срб"

I don't think that's quite intentional, probably just a side-effect of how it's been implemented (and should probably be made more consistent, or at least documented).

Anyway, if compatibility with x/net/publicsuffix is a goal, the easiest way to replicate this behaviour would probably be to run idna.ToASCII when generating the rules (which is pretty much what x/net/publicsuffix does).

weppos commented 8 years ago

Thanks for the ticket @jsha @patf

Historically, the PSL never included Punycode entries. Instead, only Unicode values are stored (althought I frankly don't recall by heard the reason of the decision, assuming there was one).

So far, I've refrained from adding U/A conversion in most of my libraries, primarily because of the extra-overhead (and lack of support for good IDNA libraries).

However, Go provides a pretty good IDNA library and I may reconsider providing native support for both A/U directly in the library.

If I understood correctly, Let's Encrypt is currently storing and handling names as A labels. Is it correct?

@jsha I'm not thrilled by 2), mostly because it's exactly the opposite behavior of what the PSL is doing. 3) is a potentially solution, although I see a value in (potentially) not having to determine whether the input is A or U (and leave it up to the library). Of course, assuming the overhead is reasonable.

On other words, I'm inclined to consider having:

What do you think?

weppos commented 8 years ago

Anyway, if compatibility with x/net/publicsuffix is a goal, the easiest way to replicate this behaviour would probably be to run idna.ToASCII when generating the rules (which is pretty much what x/net/publicsuffix does).

It is, as long as it's not a bug. That seems more a bug than a feature to me. I have to investigate it, but if that's the case I will likely not support the buggy version.

In fact, the x/net/publicsuffix comes essentially as an handy transition mechanism (and I bet most of the tools transitioning are not fully supporting IDN).

jsha commented 8 years ago

Historically, the PSL never included Punycode entries. Instead, only Unicode values are stored (althought I frankly don't recall by heard the reason of the decision, assuming there was one).

I was aware of this, but thinking about it some more it's surprising to me. The IDN RFCs indicate that U-labels are intended for presentation, while A-labels should be used for under-the-hood processing. To me, this feels like a historic accident. But perhaps it was for a good reason!

If I understood correctly, Let's Encrypt is currently storing and handling names as A labels. Is it correct?

Correct.

I'm inclined to consider having: a default version that works with U as today a potential helper methods that performs the conversion and/or a single convenient method (in addition to Find) that will automagically determine if A/U and deal with that

I'm in favor of (1) and (2). (3) seems unnecessarily risky. Callers should know whether they are handling A or U, and choose the appropriate function.

weppos commented 8 years ago

I was aware of this, but thinking about it some more it's surprising to me. The IDN RFCs indicate that U-labels are intended for presentation, while A-labels should be used for under-the-hood processing. To me, this feels like a historic accident. But perhaps it was for a good reason!

That's a reasonable feedback, and in fact I must say that we do the same at DNSimple and in pretty much most of the environments where I worked with names.

In the light of that, switching to A name by default may be logical, although the list is Unicode. It's also true that the internal list representation is irrelevant, as I pre-process it in any case.

I need to take a couple of days to experiment a little bit on which internal representation may be more appropriate. I'd also have a quick check with @sleevi and @gerv (which have an extensive experience on how Chromium/Firefox browsers handles the A/U transformation) to see if they have some specific hint (in addition to a possible explanation of why we store Unicode values).

@jsha I see the issue is currently causing you a bug in the rate-limiting. Is it if I take a few days to research the issue, or do you need a super quick fix? In case, you can merge letsencrypt/boulder#2278 and I'll happily provide a patch once publicsuffix-go is updated. Does it work for you?

jsha commented 8 years ago

I'm willing to a wait a few days while you research the issue. The rate limiting issue makes things too strict rather than too lax, and I haven't heard from people who are having problems with it. Thanks!

gerv commented 8 years ago

Why we chose U-label as the default (the A-label is in a comment) is lost in the mists of history, but perhaps it was because we thought that was the "proper" version of the name, and the A-label is an implementation detail. Anyway, as there are lossless conversions between the two, it doesn't seem like a big deal.

The two namespaces also don't overlap - it's not permitted to have a U-label of any sort, even an ASCII-only one, with hyphens in the third and fourth position. So it would be entirely reasonable (I'd say) for a smart API to take either, work out which, do the lookup, and return the results in the form submitted.

But if you don't want to do that, any of these other fixes are fine. But I don't think we'll be changing the PSL.

Gerv

sleevi commented 8 years ago

On Wednesday, October 26, 2016, Gervase Markham notifications@github.com wrote:

Why we chose U-label as the default (the A-label is in a comment) is lost in the mists of history, but perhaps it was because we thought that was the "proper" version of the name, and the A-label is an implementation detail. Anyway, as there are lossless conversions between the two, it doesn't seem like a big deal.

IDNA2003 vs IDNA2008 was the concern at the time, and the fact that a given name can, in theory, have two separate representations. At the time, Mozilla was trying to keep its flexibility - at least based on past code I can tell from around that time.

I haven't heard any substantial movement to IDNA2008 for quite some time. Perhaps it's not happening? Or did it already happen and I missed it?

The two namespaces also don't overlap - it's not permitted to have a U-label of any sort, even an ASCII-only one, with hyphens in the third and fourth position. So it would be entirely reasonable (I'd say) for a smart API to take either, work out which, do the lookup, and return the results in the form submitted.

But if you don't want to do that, any of these other fixes are fine. But I don't think we'll be changing the PSL.

Gerv

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weppos/publicsuffix-go/issues/31#issuecomment-256369143, or mute the thread https://github.com/notifications/unsubscribe-auth/ABayJ6W9p1v4YdBIJYmYUJ2pd-u4DIzBks5q32cfgaJpZM4KfIXf .

gerv commented 8 years ago

We switched 8 months ago, in Firefox 45: https://bugzilla.mozilla.org/show_bug.cgi?id=1218179

Feel free to join us :-) https://bugs.chromium.org/p/chromium/issues/detail?id=505262

jsha commented 8 years ago

FWIW, the current ACME spec requires IDNA2008: https://ietf-wg-acme.github.io/acme/#rfc.section.6.1.4.

sleevi commented 8 years ago

@gerv I'll poke jshin on that, but I think it would at least highlight a 'possibility' where the U-label form is valuable - to allow consistent decomposition to the A-label in both IDNA2003 and IDNA2008 apps. Of course, I wish we'd just go with IDNA2008.

Perhaps an interim solution is convert to A-label across the board (which will certainly break Chrome's PSL implementation, and presumably Firefox, but we can fix both of those easily) in IDNA2008, and then perhaps add a consistency check as part of the PSL's tests to make sure the IDNA2003 and 2008 representations are consistent, until the PSL full throws IDNA2003 under the bus?

gerv commented 8 years ago

Converting to A-label would presumably break everyone's implementation? What would be the advantages of doing that?

sleevi commented 8 years ago

@gerv - As jsha@ mentioned, it would allow applications to deal only with A-Labels when dealing with the PSL, without having to worry about the IDNA transformation. I certainly know that some of our mobile applications that don't support U-Labels don't want to carry the weight of an IDNA translation, to keep download size small.

weppos commented 8 years ago

@sleevi @gerv I agree that changing the list now may potentially break implementation. In fact, we can even consider providing 2 alternate sources if we decide to go with the transition, and automatically convert one into another.

It's also worth to mention that consumers can perform the translation on their own, pre-processing the file.

In fact, my initial research goal is to understand which representation is better for the specific case of this library. If it happens to be A-labels, I'll be happy to add an extra step in the pre-processing that converts it into A-label strings. https://github.com/weppos/publicsuffix-go/blob/master/cmd/gen/gen.go#L60

I can check directly, but you may save me some time. Given that Chrome/Firefox are pre-processing the list, do you know if the data is stored in A-labels or U-labels?

gerv commented 8 years ago

weppos: if we are going to change the canonical representation, we should roll this change in with all the other format changes we want to make. Perhaps we need a wiki page to track them all? :-)

Firefox stores them as A-labels: http://hg.mozilla.org/mozilla-central/file/default/netwerk/dns/prepare_tlds.py

sleevi commented 8 years ago

Chrome preprocesses the list and converts all U-Labels to A-Labels.

weppos commented 8 years ago

Perhaps we need a wiki page to track them all? :-)

That sounds like a plan @gerv. I'll start it, and post on the new PSL ml for discussion.

So it seems to me that A-labels seems really to be the best representation here. So I'm actually more convinced about pre-process the list into A-names, and switch it to use A-names by default as originally suggested by @jsha

Once done, I may provide U-label helpers.

Thanks for joining the discussion @sleevi and @gerv

kachkaev commented 8 years ago

Any updates or a roadmap on this by a chance? Just wondering as it seems that this issue makes it impossible to obtain let's encrypt certificates for certain IDN domains. Seems like thousands of users may be affected: https://github.com/letsencrypt/boulder/issues/2277

If anyone knows a workaround, could you please share?

weppos commented 8 years ago

@kachkaev I've spent this week finally releasing a major version of my whois library that has been under development for over 1 year. I honestly didn't have a lot of time to work on a patch.

I'll try to work on it this week. Unless any big objection, it's likely I'll switch to A-labels by default. However, it's not a 10-minutes change as I have to put in place a procedure to pre-process the definition, as I definitely don't want to do it at runtime.

I have a script that pulls the PSL every day and auto-submit PR for changes, and that needs to be changed as well.

I'll post updates here, and I'll make sure to share the code in a branch as soon as ready, so that it's possible to start testing it against letsencrypt/boulder#2277 and letsencrypt/boulder#2278

pzb commented 8 years ago

I agree that A-labels are the way to go. Everytime I use one of @weppos' great libraries, I end up adding a check for the xn-- substring in the calling function and doing conversion there and then converting back. This is because the PSL is inherently tied to DNS, which uses A-labels. There is no need to guess how to encode 手表; if it s a TLD is it always xn--kpu716f.

As for IDNA2003 vs IDNA2008, isn't this solved by using A-labels? From what I can tell, the A-label to U-label conversion is the same for both, but the reverse may vary. If correct, then the only question is whether the IDNA2003 and IDNA2008 algorithms have different output for entries on the list, and the only answer the matters is what is in DNS.

slavonnet commented 8 years ago

example "рф". National russian domain! https://xn----itbzecmx.xn--p1ai/

weppos commented 8 years ago

I want to personally thank everyone who provided feedback in this discussion, especially the folks at Let's Encrypt for reporting the issue, and @gerv @sleevi @pzb for providing their personal experience.

The issue is fixed by #40. The library now defaults to A-labels, and pre-process the PSL to make sure it is consistent with the expected internal representation.

New autopull requests will automatically provide the A-label form of any upstream PSL update.

weppos commented 8 years ago

FYI @pzb I'll see if I can replicate the same work on the publicsuffix-ruby lib.