ypcrts / fqdn

RFC-compliant FQDN validation and manipulation for Python.
http://fqdn.readthedocs.io/
Mozilla Public License 2.0
30 stars 11 forks source link

RFC1035 and RFC1912 are incompatible #22

Closed wakemaster39 closed 4 years ago

wakemaster39 commented 4 years ago

As with the original task for the underscore change, the world is not compliant the removal of allowing numbers at the beginning of hostnames has exploded alot of my URLs.

Need the capability to allow hostnames to begin with digits.

wakemaster39 commented 4 years ago

https://tools.ietf.org/html/rfc1912

   Allowable characters in a label for a host name are only ASCII
   letters, digits, and the `-' character.  Labels may not be all
   numbers, but may have a leading digit  (e.g., 3com.com).  Labels must
   end and begin only with a letter or digit.  See [RFC 1035] and [RFC
   1123].  (Labels were initially restricted in [RFC 1035] to start with
   a letter, and some older hosts still reportedly have problems with
   the relaxation in [RFC 1123].)  Note there are some Internet
   hostnames which violate this rule (411.org, 1776.com).  The presence
   of underscores in a label is allowed in [RFC 1033], except [RFC 1033]
   is informational only and was not defining a standard.  There is at
   least one popular TCP/IP implementation which currently refuses to
   talk to hosts named with underscores in them.  It must be noted that
   the language in [1035] is such that these rules are voluntary -- they
   are there for those who wish to minimize problems.  Note that the
   rules for Internet host names also apply to hosts and addresses used
   in SMTP (See RFC 821).

Good news with this, we found a source where underscores are allowed in hostnames, even if it is voluntary only

wakemaster39 commented 4 years ago

I think rfc1035 might need to become a flag to modify the regex, i think the default should be the latest RFC spec.

This then raises the underscore question of is voluntary advice a default or a flag like we have it now.

ypcrts commented 4 years ago

hmm. and i even own Internet host names with labels that are entirely digits. oops.

however the original use case for this module was to validate what could get a TLS cert signed by a CA, and they don’t accept underscores.

I suppose we could shift away from regex entirely and reimplement rules by using sets and lengths.

here’s the rulesets i see right now:

  1. RFC 1035 preferred name syntax for labels: no initial digits, no underscores, no initial or terminal hyphens - though i’m not sure what the use case for this would be, since it seems outdated. it’s already implemented anyway.

  2. RFC1123: allows labels to start with digits.

  3. RFC1033 (not authoritative) but allows underscores anywhere in Internet hostnames

  4. a restriction requiring a minimum number of labels, defaulting to 2, because that’s how this module originally shipped, and i don’t want to break anyone.

  5. the domain names that CAs will sign TLS certs for: no underscore + 2 label minimum

the question of what should be the default is painful because of the 2 label minimum and the original focus on internet hostnames for CAs. if we could actually nail this down properly, i’d be happy to implement one final minor version supporting python 2.7 that focuses on backwards compatibility (internet hostnames for CAs), and then do a major version bump with a better default (closer to Chromium or RFC1123) and drop python2 support. Maybe the major version bump should also break the api in true python3 fashion so as to ensure anyone who upgrades blindly ends up needed to adjust their code or downgrade.

I’m really against breaking changes but I think my original read of the requirements was quite bad.

ypcrts commented 4 years ago

I see I also put in another restriction that suits Internet hostnames: that the root label (TLD) cannot be only digits. That one came from RFC 3696 s2 which isn’t authoritative, but it seems for this use case.

RFC 2181 is the latest and is authoritative. It also describes another use case which is purely for DNS, where the only restrictions are label length and total length, and the label chars can be any binary data, unrestricted. Perhaps the new api should also implement this.

wakemaster39 commented 4 years ago

I think we we kept the default compatible it should be fine, I also don't know about catering to older specs. I mean the library could, but with dropping 2.7 and going to something like 3.6 or 3.7+ (drop all the EOL'd pythons) I think catering to the latest spec is best.

I think you could then add a couple new properties, is_1033_compliant or something if we really needed to show which RFCs it is breaking from history but I don't have a use case for it. Meeting latest, and allowing the flex of underscores is the end of my current use case.

ypcrts commented 4 years ago

yeah, there doesn’t seem to be a real world use case where rfc1033 (without rfc1123’s relaxation) is practical.

wakemaster39 commented 4 years ago

24 i think resolves this, just going back to the original format of accepting digits.