Validating internationalized mail addresses in <input type="email">

whatwg / html

HTML Standard

https://html.spec.whatwg.org/multipage/

Other

8.18k stars 2.71k forks source link

Validating internationalized mail addresses in <input type="email"> #4562

Open jrlevine opened 5 years ago

jrlevine commented 5 years ago

This is more or less the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 but I think it's worth another look since a lot of things have changed.

The issue is that the e-mail address validation pattern in sec 4.10.5.1.5 only accepts ASCII addresses, not EAI addresses. Since last time, large hosted mail systems including Gmail, Hotmail/Outlook, Yahoo/AOL (soon if not yet), and Coremail handle EAI mail. On smaller systems Postfix and Exim have EAI support enabled by a configuration flag.

On the other side, writing a Javascript pattern to validate EAI addresses has gotten a lot easier since JS now has Unicode character class patterns like /(\p{L}|\p{N})+/u which matches a string of letters and digits for a Unicode version of letters and digits.

Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

For the avoidance of doubt, when I say EAI, I mean both Unicode local parts and Unicode domain names, since that's what EAI mail systems handle. There is no benefit to translating IDNs to A-labels (the ones with punycode) since that's all handled deep inside the mail system.

klensin commented 4 years ago

--On Monday, August 3, 2020 17:06 -0700 John L notifications@github.com wrote:

I'd just leave the labels as 1*63( utext ), since utext is a superset of atext. This pattern can only be an approximation of what's legal so I wouldn't try too hard to be clever. I think it matches actual addresses pretty well but there are valid addresses it'll reject like "...."@example.com and invalid ones it'll accept like ....@example.com or anything with a non-existent domain name.

This seems reasonable to me... and, if we change the quoting rules, both will become invalid and hence properly rejected.

Again, the principle should be that addresses that the protocols allow and that real people, in the real world, are likely to actually use do not get rejected by an HTML-based mechanism. That principle will allow a certain amount of nonsense, but the precise rules are too hard and, will, as you more or less point out, miss the many cases in which an address has valid syntax but just doesn't exist no matter what is done.

best, john

nicowilliams commented 4 years ago

On Mon, Aug 03, 2020 at 05:06:58PM -0700, John L wrote:

I'd just leave the labels as 1*63( utext ), since utext is a superset of atext. This pattern can only be an approximation of what's legal so I wouldn't try too hard to be clever. I think it matches actual addresses pretty well but there are valid addresses it'll reject like "...."@example.com and invalid ones it'll accept like ....@example.com or anything with a non-existent domain name.

Client-side validation is helpful to detect errors, but not to protect servers. So how about: if an address does not validate, allow it anyways but display in in some way as to indicate the likely error?

jrlevine commented 4 years ago

Client-side validation is helpful to detect errors, but not to protect servers. So how about: if an address does not validate, allow it anyways but display in in some way as to indicate the likely error?

it is my impression that the main point of the validation RE is to reject nonsense addresses like nobody@here. We nerds know all the corner cases and how to persuade our mail servers to handle wacky addresses but for normal people, the reasonable approach is to insist that the user provides an address that validates against the RE before proceeding.

annevk commented 4 years ago

I don't really understand some of the above remarks. We're not confined to regular expressions or JavaScript. At the same time I don't think browser email validation should be stricter than URL validation when it comes to domain names (no host length checking when parsing URLs).

@aphillips how do atext and utext mix? UTF8-non-ascii are byte sequences, not code points.

aphillips commented 4 years ago

@annevk That's the reason I prefer the last ABNF, where I used code points and not byte sequences:

utext         = atext / %x80-D7FF / %E000-10FFFF ; unreserved printable ASCII characters or any non-ASCII Unicode code points

... and then just used utext instead of atext. As noted above, I need to provide a fix for label, since atext isn't appropriate for the right hand side, so let's do that here:

email               = 1*( utext / "." ) "@" label *( "." label )
atext               = < as defined in RFC 5322 section 3.2.3 >
utext               = atext / %x80-D7FF / %E000-10FFFF
label               = label-start [ *61[label-part] label-start ]
label-start         = ALPHA / DIGIT / %80-D7FF / %E000-10FFFF
label-part          = [ label-start / "-" ]

I could also do away with atext by relisting the code points (since importing the definition in RFC6532 gets us bytes):

utext           = ALPHA / DIGIT / "!" /                    ; unreserved printable ASCII
                      "#" / "$" / "%" / "&" / "'" / "*" /  ; as defined in RFC5322 section 3.2.3
                      "+" / "-" / "/" / "=" / "?" / "^" / 
                      "_" / "`" / "{" / "|" / "}" / "~" /
                      %80-D7FF  / %E000-10FFFF             ; or any non-ASCII Unicode

If you think we shouldn't impose host length checking, we can remove the 61 from the label production. As noted in preceding comments, the length of a label might have a shorter limit if any non-ASCII are used (as few as 14 or so if supplementary characters are used). By keeping a length of 63 we're being more-or-less compatible with any existing length checks.

I also suspect I should exclude the C1 controls by changing %80 to %A0

We about ready for text? 😉

annevk commented 4 years ago

I still don't understand. You're defining utext in terms of atext yet one is code points and the other is bytes. (The other thing to look into regarding length might be to check what browsers do today for type=email. I suspect they don't check it.)

aphillips commented 4 years ago

The text above the ABNF says that the character set is Unicode and the treatment amounts to code points rather than bytes. If I just adopt the utext production from my comment above (and get rid of atext altogether), that would make the definition clear, no?

From a quick check, FF and Chrome both length check the label (right hand side). Neither appear to check the left hand side (which is consistent with the ABNF's 1*). In fact, both length check the A-label length of non-ASCII domain names, so it's more complex already.

annevk commented 4 years ago

I see, that would work. And yeah, looking at https://searchfox.org/mozilla-central/source/dom/html/input/SingleLineTextInputTypes.cpp#181-243 I guess the email validation method invokes IDN differently from the URL parser. Good times. In general I hope that if we tighten this up reuse of https://url.spec.whatwg.org/#host-parsing or https://url.spec.whatwg.org/#concept-domain-to-ascii is feasible. It doesn't seem good to add more primitives just for email validation (and potentially some normalization, as Chrome appears to be doing).

klensin commented 4 years ago

--On Tuesday, August 4, 2020 09:53 -0700 Anne van Kesteren notifications@github.com wrote:

I see, that would work. And yeah, looking at https://searchfox.org/mozilla-central/source/dom/html/input/Si ngleLineTextInputTypes.cpp#181-243 I guess the email validation method invokes IDN differently from the URL parser. Good times. In general I hope that if we tighten this up reuse of https://url.spec.whatwg.org/#host-parsing or https://url.spec.whatwg.org/#concept-domain-to-ascii is feasible. It doesn't seem good to add more primitives just for email validation (and potentially some normalization, as Chrome appears to be doing).

Yes.

My other suggestion, which may have little to do with the immediate problem, is at least to use caution in applying https://url.spec.whatwg.org/#concept-domain-to-ascii because the decoding it specifies is based on [Unicode toASCII] which, in turn, depends on RFC 3490, which was a March 2003 document that because obsolete a decade ago. I see at least two issues:

(1) The ToASCII operation in UTR #46 and the one in the referenced WHATWG specification use different sets of flag settings. Even if one wants to rely on Unicode specifications (particularly UTR#46) rather than IETF ones, this is an invitation to "works some places and not others" confusion.

(2) While I assume browsers are operating consistent with the WHATWG spec (and hence UTR#46), many, probably most, email systems that allow non-ASCII addresses are conformant to the IDNA2008 specs instead. Since deliverability --an important component of actual validity-- of email depends on the latter, the difference may result in false negatives and unnecessary and inappropriate rejection of names in various edge cases. I hope we are agreed that is a bad idea.

best, john

annevk commented 4 years ago

John, UTR 46 can be used in IDNA2008-compatible mode (even if not immediately apparent) and apart from Chrome browsers use it in that way. (Edit: to be clear, the URL Standard also uses it that way.)

aphillips commented 4 years ago

This seems like a viable approach and would describe what browsers actually do (since it seems that we're catching the spec up to actual practice, not spurring browser vendors into action). The bottom part of the input type=email section needs more work that I originally thought, since items like the perl/JS regex example would need to be removed.

For the ABNF, perhaps:

email      = localpart "@" domain
localpart  = 1*( utext / "." )
utext      = ALPHA / DIGIT / "!" /                    ; unreserved printable ASCII
                 "#" / "$" / "%" / "&" / "'" / "*" /  ; as defined in RFC5322 section 3.2.3
                 "+" / "-" / "/" / "=" / "?" / "^" / 
                 "_" / "`" / "{" / "|" / "}" / "~" /
                 %80-D7FF  / %E000-10FFFF             ; or any non-ASCII Unicode
domain     = < a "valid host string", see URL section 3.4 >

The text can reference URL 3.5 (#host-parsing, as suggested). I don't have time today, but I'll work on a pull request later in the week to see what this looks like as a draft.

annevk commented 1 year ago

I've been thinking about trying to move this issue forward a bit again. There's two aspects I'd like us to consider, one is validation, which I think @aphillips is on the right track on, but it needs some more tweaks. The other is "normalization", which the specification suggests, but nobody implements.

Validation: I think we essentially want a non-ABNF version of what @aphillips wrote above, perhaps banning non-domain hosts as per feedback in #5799. Roughly:

Split input on @.
If there are not exactly two resulting strings whose size is greater than 1, then return failure.
If the first resulting string has a code point outside of "utext", then return failure.
If the second resulting string contains a %, then return failure. (I'm assuming we don't want to allow percent-encoded hosts. They are not allowed today.)
If running the host parser on the second resulting string does not result in a domain, then return failure. (I'm assuming we want to reject IP addresses.)
Return "it's valid".

Normalization: because EAI is not widely deployed it seems useful if when an email value is submitted the part after the @ is in Punycode. The specification already attempts to cover this with

User agents may transform the value for display and editing; in particular, user agents should convert punycode in the domain labels of the value to IDN in the display and vice versa.

but this is not implemented. Making it a normative requirement that the value exposed to the API and server is normalized accordingly would help here I think. This would be a bit novel, but the existing infrastructure does allow for it.

Compatibility: WebKit currently doesn't allow non-ASCII so this should all be okay. Chromium and Gecko allow some non-ASCII and thus might see some impact. I hope it's still limited due to these kinds of email addresses generally not being widespread.

vdukhovni commented 1 year ago

Validation: I think we essentially want a non-ABNF version of what @aphillips wrote above, perhaps banning non-domain hosts as per feedback in #5799. Roughly:

Split input on @.

If there are not exactly two resulting strings whose size is greater than 1, then return failure.

More correctly, parse the address. The localpart can legitimately contain an "@" when quoted:

"some@where"@example.net

Should these always be rejected???

If the second resulting string contains a %, then return failure. (I'm assuming we don't want to allow percent-encoded hosts. They are not allowed today.)

No. There is no such thing as "percent-encoded" hosts. There are only "percent-encoded" URLs, which may include path components and/or query parameters. By the time something is validating an address, these have been decoded, and there is nothing special about '%hh' with hexadecimal digits 'h'.

The '%' character is no more special in the domainpart of an email address than any of, e.g., "#", "!" or "?".

The domain part needs to be a valid hostname (or when that's also acceptable an address-literal).

If running the host parser on the second resulting string does not result in a domain, then return failure. (I'm assuming we want to reject IP addresses.)

You should be clear what you mean by "IP addresses".

- "user@192.0.2.1" is a syntactically valid email address with
  a domain part that does not exist (there is no ".1" TLD).
  It is not clear that rejecting these belongs in a syntax
  check.

- "user@[192.0.2.1]" is also a syntactially valid email address,
  but the domainpart is an address literal, which should only
  be accepted in limited (generally site-internal) contexts.

Normalization: because EAI is not widely deployed it seems useful if when an email value is submitted the part after the @ is in Punycode. The specification already attempts to cover this with

Since there is no equivalent mapping for the localpart, it is unclear what benefit this has. If the address is not an all-ASCII RFC5322 form, why not use UTF8 also for the domain part?

User agents may transform the value for display and editing; in particular, user agents should convert punycode in the domain labels of the value to IDN in the display and vice versa.

but this is not implemented. Making it a normative requirement that the value exposed to the API and server is normalized accordingly would help here I think. This would be a bit novel, but the existing infrastructure does allow for it.

Because that language is deliberately about display. "Normalising" the domainpart to A-labels seems odd. The specification for EAI addreses in X.509 certificates has the domain part if U-label form, when the localpart contains non-ASCII codepoints:

https://datatracker.ietf.org/doc/html/rfc8398#section-3

This document further refines internationalized Mailbox ABNF rules as described in [RFC6531] and calls this SmtpUTF8Mailbox. In SmtpUTF8Mailbox, labels that include non-ASCII characters MUST be stored in U-label (rather than A-label) form [RFC5890]. This restriction removes the need to determine which label encoding, A- or U-label, is present in the domain. As per Section 2.3.2.1 of [RFC5890], U-labels are encoded as UTF-8 [RFC3629] in Normalization Form C and other properties specified there. In SmtpUTF8Mailbox, domain labels that solely use ASCII characters (meaning neither A- nor U-labels) SHALL use NR-LDH restrictions as specified by Section 2.3.1 of [RFC5890] and SHALL be restricted to lowercase letters.

annevk commented 1 year ago

@vdukhovni HTML subsets the number of email addresses deliberately. If you want that to change I suggest filing a new issue. That's not the topic of this thread.

vdukhovni commented 1 year ago

@vdukhovni HTML subsets the number of email addresses deliberately. If you want that to change I suggest filing a new issue. That's not the topic of this thread.

No worries, that aspect wasn't the main point of my response. Sadly all the example address forms I entered were masked by github. Hard to talk about email address syntax without examples, I'll see whether some hand-editing could fix this...

[ Hand editing helped. Lesson learned: replying to github comments by email sadly rather limits the available markup and also modifies the content. :-( ]

jrlevine commented 1 year ago

Normalization: because EAI is not widely deployed it seems useful if when an email value is submitted the part after the @ is in Punycode.

No, this misunderstands how EAI works. An EAI address is unicode@unicode, an ASCII address is ascii@ascii. There's no such thing as unicode@ascii other than by default since ASCII is a subset of Unicode. This will not help mail delivery and will confuse and annoy Asian users who see their Hindi or Chinese address turned into ASCII glop.

I agree with you about % because I've never seen an address with a % other than as a test. A long time ago there was in informal convention to use % for source routing, so a whole lot of mail systems reject % to prevent attempts to do that.

vdukhovni commented 1 year ago

Normalization: because EAI is not widely deployed it seems useful if when an email value is submitted the part after the @ is in Punycode.

No, this misunderstands how EAI works. An EAI address is unicode@unicode, an ASCII address is ascii@ascii. There's no such thing as unicode@ascii other than by default since ASCII is a subset of Unicode. This will not help mail delivery and will confuse and annoy Asian users who see their Hindi or Chinese address turned into ASCII glop.

I agree with you about % because I've never seen an address with a % other than as a test. A long time ago there was in informal convention to use % for source routing, so a whole lot of mail systems reject % to prevent attempts to do that.

Is this a reply to me or Anne? Sadly my email repsponse also lost the "> " quotes until I hand-edited the response to double them up...

On the poiint of % Anne's comment was about URL-encoding %<hexdigit><hexdigit>, which isn't a thing at the layer in question. It wasn't about legacy source route syntax.

jrlevine commented 1 year ago

To try and be clearer, rejecting % in addresses is the right thing to do, even though it has nothing to with URL encoding.

josepharhar commented 1 year ago

If running the host parser on the second resulting string does not result in a domain, then return failure. (I'm assuming we want to reject IP addresses.)

I think that chromium currently allows ip addresses, so I'm not sure if I can make this change. I am generally supportive to try something new but if the breakage is too big then I'll have to roll it back.

annevk commented 1 year ago

@jrlevine in particular I was hoping it would help with ascii@unicode addresses. I strongly suspect those would be more portable when submitted as ascii@ascii. What the end user sees in the end should not be impacted here one way or another. That's up to email clients. (And for the control we'd just show the user what they typed, so they wouldn't be confused by it either.)

(Forbidding % in hosts has to do with URL percent-decoding in the URL Standard's host parser.)

jrlevine commented 1 year ago

in particular I was hoping it would help with ascii@unicode addresses

No, really, you do not want to do that. The experimental versions of EAI had a bunch of clever tricks that were supposed to help with ASCII backward compatibility. They all turned out to be confusing and not useful and led to even more problems like EAI responses to non-EAI mail systems, and were dropped from the final version.

Consider this made up but I think plausible example, with two addresses:

  renée@épost.quebec
  marie@épost.quebec

You would turn them into

  renée@épost.quebec
  marie@xn--post-9oa.quebec

So maybe one will be accepted, one won't, and the user will be baffled. This is not a bug you can patch around, so please do not try. When someone gives you an address, either accept it or reject it but don't try to rewrite it.

annevk commented 1 year ago

Given that EAI is still far from widespread support after this many years, that honestly seems preferable. EAI supporting systems already have to support both IDN and Punycode so I'm not sure why we'd be worried about those.

(This is also what Chromium is already doing for user-supplied values and as they noted early on in the thread they'd be rather worried about breaking compatibility around that.)

annevk commented 1 year ago

@vdukhovni I think what you're saying about IP addresses is what was also pointed out in https://github.com/whatwg/html/pull/5799#discussion_r473929302. Let's try in question form:

Given an email user ipv4 on IPv4 address 123.123.123.123, what's their email address?
Given an email user ipv6 on IPv6 address [::1], what's their email address?

aphillips commented 1 year ago

I18N discussed this today (2023-10-05). This is a summary of our basic discussion.

The purpose of input type=email is not to actually send email, but to allow input of email addresses on the Web in forms. These may ultimately be composed into an email or they might be used for other purposes (user logins are one common use).

EAI does not impose very many restrictions on left-hand-side values (local part) and, further, allows SMTP receiving servers to perform various normalizations or transformations when matching the local part to "a mailbox". Thus HTML should not impose unnecessary restrictions or transformations on the left-hand-side. The validation should take into account any quoting defined by the various email standards.

The right-hand-side generally contains a domain name, which is something browsers already know about and thus can apply additional validation to. EAI prefers u-labels to a-labels because that is user friendly, that is, transforming marie@épost.quebec to marie@xn--post-9oa.quebec is not user-friendly (and doubly so if the consumer of the HTML form would like to match the display-friendly name the user might have entered elsewhere). However, RHS validation could extend to checking for sequences that don't make sense in IDNA.

Overall, I think I18N find ourselves mostly in agreement with @annevk's comment. However, I don't think that browsers should transform u-labels to a-labels on the right hand side.

Note about the relative deployment of EAI: there are reasonable reports that at least one rather large country has widespread (mostly internal) use of EAI addresses. Non-interoperability of EAI across the ecosystem has been one of the barriers to more widespread adoption.

Finally, the tests in 7f4e5db look like what I'd expect to see.

vdukhovni commented 1 year ago

@vdukhovni I think what you're saying about IP addresses is what was also pointed out in #5799 (comment). Let's try in question form:

Given an email user ipv4 on IPv4 address 123.123.123.123, what's their email address?

Given an email user ipv6 on IPv6 address [::1], what's their email address?

This is covered in Section 3.4.1 of RFC5322, which delegates the specific syntax of domain-literals to Section 4.1.3 of RFC5321. When domain-literal addresses are supported (which is properly a question of semantics, rather than syntax, but perhaps in some applications it could be acceptable to reject domain-literals as excluded syntax), their proper format is:

user@[192.0.2.1], user@[127.0.0.1]
user@[IPv6:2001:db8::feed:cafe], user@[IPv6:::1]

[ FWIW, meanwhile, my "user experience" with EAI in Thunderbird is rather subpar, I can't set a UTF8 address as either primary or secondary "Identity" for a mail account. I can only manually set such a "From:" address on a per-message basis, and then still run into some issues, the message-id header created by Thunderbird ended up empty, which was rejected by Gmail... Also the envelope sender ended up coincidentally an address-literal: anonymous@[IPV6:...]

Also OpenDKIM fails to match EAI domains against my signer table, which has both A-label and U-label forms listed. ]

jrlevine commented 1 year ago

These may ultimately be composed into an email or they might be used for other purposes (user logins are one common use).

Good point, thanks. This matches the experience of EAI designers that trying to do partial backward compatibility just makes things messier and does not interoperate well.

Note about the relative deployment of EAI: there are reasonable reports that at least one rather large country has widespread (mostly internal) use of EAI addresses. Non-interoperability of EAI across the ecosystem has been one of the barriers to more widespread adoption.

I'm aware of a community of EAI mail users in Thailand, and one of the states in India is reportedly assigning EAI email addresses to its citizens along with their Aadhar IDs. (Indian states have more people than many European countries.) None of this mail is in English or even Latin script so it's not surprising we wouldn't have encountered it.

klensin commented 1 year ago

--On Thursday, October 5, 2023 07:59 -0700 Anne van Kesteren @.***> wrote:

@vdukhovni I think what you're saying about IP addresses is what was also pointed out in https://github.com/whatwg/html/pull/5799#discussion_r473929302 . Let's try in question form:

Given an email user ipv4 on IPv4 address 123.123.123.123, what's their email address?

Given an email user ipv6 on IPv6 address [::1], what's their email address?

My apologies in advance for digging a bit further into the actual email specifications, especially SMTP as well as the i18n extensions developed by the EAI WG, but, to the extent these are address validation questions, it is important to understand the actual rules.

Before it is possible to answer either question above, you need to define what you think "email user" means and, in particular, how one defines the relationship between "email user" --presumably a name-- and the associated email address. Remember that, if my name were Donald Duck, that does not imply that my email user name (presumably something I use to communicate with/ log into my email system or client) is necessarily either "Donald" or "Duck". Even if it were, nothing (other than fear of trademark lawyers or possibly restrictive registry or email provider policies) would prevent me from having an email address of mıckу@魔法.königreich.example

While that is an extreme case, many people have chosen, for some purposes, to use email addresses that are deliberately hard to type or confusing to the user as protection against spammers, phishers, and other unsavory forces.

One cannot interpret or validate an email address without a deep knowledge of, or a conversation with, the delivery SMTP server for that address. That has been the case since even before

The issues include whether "%" (or, for that matter, "!" or "@") in the local-part are given special interpretations or not -- only the delivery server can decide. In that regard, syntactically, @.**@. and "abc\ @. are perfectly valid email addresses and that only the relevant delivery server can actually determine whether @. and @.*** represent the same address.

There are also no protocols that guarantee being able to reach that delivery server to ask: the closest thing that exists, the SMTP VRFY command, reaches the next-hop server and not necessarily the delivery one. It has fallen into disfavor anyway and many servers are configured to not accept it. SMTPUTF8 (incorrectly known as "EAI") does not change any of that at all; it just allows non-ASCII characters in the address.

It is probably worth noting that web-based systems have, over the years, caused considerable problems with perfectly valid email addresses by claiming they are not valid or by assuming they could be mapped into other strings. Trying to do validity checking on non-ASCII addresses would, most likely, just create another round of those problems.

Bottom line: whether all-ASCII or containing some non-ASCII characters, if a user tells an HTML system that something is an email address, or if an email address reaches it from some other system, it is best to believe them, treat it as an opaque string, and leave questions of validity and deliverability to actual email systems. Interfaces between HTML and email systems should probably be more concerned about how non-delivery notifications and other error reports from the email environment will get back to the HTML user than about how to guess about address validity.

thanks, john

annevk commented 1 year ago

@klensin it sounds like you are suggesting something like:

If input does not contain a @, invalid.
If input, with leading and trailing whitespace removed, split on the first @ yields at least one empty string, invalid.
Valid.

That's a massive change from the status quo and I'd worry that instead of leading to change it would instead delay progress on this issue for another x years. E.g., whereas previously you might have some expectation that if <input type=email> considered the email address valid it could not cause XSS, that would no longer be guaranteed.

I was instead attempting to describe a set of minimal changes to what user agents do today to accommodate more end users while simultaneously not break the existing expectations around what <input type=email> offers too much, similar to @aphillips' earlier ABNF post. I also attempted to align host handling with URL host handling to some extent, but it seems people are very comfortable with those wildly diverging?

aphillips commented 1 year ago

@annevk I think what you're trying to do is the right thing in general. I don't think that what @klensin is saying would prohibit checking the right hand side to see if it is potentially a valid domain name. This would be what gives value to HTML's users when they use input type=email. The nature of the beast is that, once we allow non-ASCII in, this is not radically different from "3. Valid"

vdukhovni commented 1 year ago

@annevk I think what you're trying to do is the right thing in general. I don't think that what @klensin is saying would prohibit checking the right hand side to see if it is potentially a valid domain name. This would be what gives value to HTML's users when they use input type=email. The nature of the beast is that, once we allow non-ASCII in, this is not radically different from "3. Valid"

Well, there are certainly many exceptions to "3. Valid" in IDNA domain names. Each label of a domain name is either:

An NR-LDH label.
A valid IDNA ACE-prefixed A-label
A valid U-label

Validating that the input labels are in one of these forms is not unreasonable, and there are libraries that can assist in this task (libicu implements UTS46 rather than IDNA2008, but it does accept everything that IDNA2008 accepts).

For example: духовный.org is accepted by libicu, but духов+ный.org is rejected, as are many invalid ASCII forms.

As for local parts, indeed valid "dot-atom" or "quoted-string" forms should be accepted with few limitations.

jrlevine commented 1 year ago

If you're worried about XSS I wouldn't be bothered by rules that rejected strings that look like URLs, since valid mail addresses with strings like :// are vanishingly rare, but I concur with the advice not to try to guess what addresses mail systems might accept, and definitely do not rewrite the address.

klensin commented 1 year ago

--On Friday, October 6, 2023 08:53 -0700 Viktor Dukhovni @.***> wrote:

@annevk I think what you're trying to do is the right thing in general. I don't think that what @klensin is saying would prohibit checking the right hand side to see if it is potentially a valid domain name. This would be what gives value to HTML's users when they use input type=email. The nature of the beast is that, once we allow non-ASCII in, this is not radically different from "3. Valid"

Exactly right. If you want to check the domain-part, by all means do that, but, if you do, please be sure that your checks are consistent with IDNA2008.

I write "domain part" rather than "right hand size" because, in a presentation environment with right to left scripts, especially for labels consisting of mixes between characters from right to left scripts with Indo-Arabic digits, "right hand" may involve additional ambiguity.

Well, there are certainly many exceptions to "3. Valid" in IDNA domain names. Each label of a domain name is either:

An NR-LDH label.

A valid IDNA ACE-prefixed A-label

A valid U-label

or invalid. And there are many ways to be invalid.

Validating that the input labels are in one of these forms is not unreasonable, and there are libraries that can assist in this task (libicu implements UTS46 rather than IDNA2008, but it does accept everything that IDNA2008 accepts).

Yes, as long as everyone involved is aware that there are many labels that IDNA2008 (and many non-web implementations) will reject that UTS46 will not. As long as there are no false negatives (rejecting a valid name), the only major difficulty with false positives (identifying a valid name as invalid) involves users seeing different error messages for what appears to them to be the same error but that, for us, is different problems caught at different points in the system.

For example: духовный.org is accepted by libicu, but духов+ный.org is rejected, as are many invalid ASCII forms.

But that example is where the problems start because some libraries and web forms trying to support entry of email addresses will also reject @.***' which is, syntactically, a valid email address. Such rejections cause real damage, bug reports, etc. As long as the validation testing is confined to the domain part, no problem as long as people do not expect that all invalid domains name strings will be rejected at, e.g., form input time.

As for local parts, indeed valid "dot-atom" or "quoted-string" forms should be accepted with few limitations.

And that is, IMO, the important part, again focusing on avoiding rejection of valid addresses. However, coming back to @annevk's question, he wrote:

@klensin it sounds like you are suggesting something like:

If input does not contain a @, invalid.

If input, with leading and trailing whitespace removed, split on the first @ yields at least one empty string, invalid.

Valid.

That list, especially the second entry, needs to be interpreted with great care or not used at all. For example @.**@. is a valid email address. If some sort of split occurs on the "first '@'", it will almost certainly produce bad effects. Those effects will get even worse because "\ @.**@. is also a valid email address as is "\ @. Worse, there is a long history of various systems stripping quotation marks and not quite getting it right and, in the interest of robustness, of receiver-SMTP systems allowing, e.g., @example.com and treating it as equivalent to "\ @. If someone asked my advice as to whether an address with nothing other than whitespace in the local-part would be wise, my response would be that, in the general case, it would be really stupid (just like treating upper and lower case ASCII characters as different). But it should not be up to HTML and 'type=email' to try to make rules about it. Actually, let me say that a bit more strongly: if the present code prevents any of the examples above (other than bare @.***"), it is a serious bug that should be fixed, not something that should be preserved for compatibility reasons.

best, john

collinanderson commented 9 months ago

FYI: Regarding the left-hand-side (LHS), Unicode provides guidance in Unicode Technical Standard #39: Unicode Security Mechanisms (UTS39) about what to allow for the "local-part" and atext/utext:

https://www.unicode.org/reports/tr39/#Email_Security_Profiles

The local-part of an email address must satisfy all the following conditions:

It must be in NFKC format

It must have level = <restriction level> or less, from Restriction_Level_Detection

It must not have mixed number systems according to Mixed_Number_Detection

It must satisfy dot-atom-text from RFC 5322 §3.2.3, where atext is extended as follows:

Where C ≤ U+007F, C is defined as in §3.2.3. (That is, C ∈ [!#-'*+-/-9=?A-Z\^-~]. This list copies what is already in §3.2.3, and follows HTML5 for ASCII.)

Where C > U+007F, both of the following conditions are true:

C has Identifier_Status=Allowed from General Security Profile

If C is the first character, it must be XID_Start from Default Identifier_Syntax in [UAX31]

Note that in RFC 5322 §3.2.3:
dot-atom-text   =   1*atext *("." 1*atext)
That is, dots can also occur in the local-part, but not leading, trailing, or two in a row. In more conventional regex syntax, this would be:
 dot-atom-text   =   atext+ ("." atext+)*
Note that bidirectional controls and other format characters are specifically disallowed in the local-part, according to the above.

Unicode doesn't recommend which <restriction level> to use regarding multiple scripts, though Google's email sending guidelines seem to say to use the "Highly Restrictive" one: https://support.google.com/a/answer/81126?hl=en#message-format. I haven't found any guidance from Microsoft or Yahoo regarding local-part restrictions. Microsoft at least has a page on EAI but not much guidance: https://learn.microsoft.com/en-us/globalization/reference/eai

collinanderson commented 9 months ago

Also, FYI, there's an ICANN-sponsored organization called Universal Acceptance Steering Group (https://uasg.tech/) for "Working to make all valid domain names and email addresses work in all Internet-enabled applications, devices, and systems".

They have $6,000 USD budgeted for Fiscal-Year 2024 specifically for "M5: Analyze impact of the use of HTML5 email field: Collect data and identify how to address HTML5 email field for accepting globally inclusive email addresses as identifiers".

They clearly want html to allow unicode email addresses, but in my opinion don't provide clear enough recommendations for validation.

jrlevine commented 9 months ago

Yes, I know, I was in the UASG calls that put the proposal together. I told them that it's a hard problem and that the approaches so far, WHATWG's current pattern and the W3C's attempted fix that allows any UTF-8 are both clearly wrong.

A big problem is that there are so few mail systems assigning UTF-8 addresses that there's no common practice to work from.

klensin commented 9 months ago

Collin, a few observations; First, the standard for "internationalized" (i.e. non-ASCII) email addresses is RFCs 6530-6533. If Unicode's recommendations are different from that, one is looking for interoperability problems in which perfectly valid addresses don't work

That said, it is easy to feel that what the standard allows is more permissive than usually makes good sense. That is was deliberate: the WG felt that, for email local parts, there should be maximum flexibility to use addresses that are well adapted to local circumstances. That flexibility is particularly important when, for example, email local parts reflect people's names but some names are not written in the same way (sometimes not even in the same characters) as is normal for words or other strings in the local language.

The general principle for email since the 1970s -- long before non-ASCII local parts because the subject of standardization-- is that operators of (or their management or decision makers) of systems that receive mail, store it, and make it available to users make the decisions about their systems allow (at least within the boundaries of what is permitted by the standards) and that messages sent, and systems closer to those that support the human message originator, don't try to guess what those destination systems will accept because any two destination systems might have quite different internal rules.

So, if someone operating a mail system that receives mail and supports mailboxes came to me and asked whether the Unicode rules you cite would be reasonable guidance for giving out mailbox names, my answer might well be "yes", at least as long as those rules do not conflict with established local usage.

But, at least for the overwhelming number of cases, that has nothing do so with what HTML should consider a valid email address. If, for example, there is a form that requires a user to enter an email address or a page that specifies an address as part of contact information. those addresses are whatever they are and should not try to guess what they should (or might) have been in some alternative universe, much less decide that something allowed by the standard is invalid because someone finds it distasteful.

Fwiw, I agree with John Levine about few systems and an absence of clear common practice. At the same time, I know mail providers who claim, in the aggregate, millions of non-ASCII addresses allocated, in use, and working well within their countries. And, beyond a certain point and for the reasons above, it does not make much difference. If a provider with even only a few thousand customers allocates addresses that conform to the standard, and HTML effectively blocks the use of those addresses, it would be really bad for all concerned.

nicowilliams commented 9 months ago

On Thu, Feb 15, 2024 at 09:48:41AM -0800, Collin Anderson wrote:

FYI: Regarding the left-hand-side (LHS), Unicode provides guidance in Unicode Technical Standard #39: Unicode Security Mechanisms (UTS39) about what to allow for the "local-part" and atext/utext:

https://www.unicode.org/reports/tr39/#Email_Security_Profiles

The local-part of an email address must satisfy all the following conditions:

There's more unquoted context that this is to apply only at local-part mailbox registration time, in browsers when "linkifying" a mailto:, and in MUAs when displaying addresses (e.g., sender, cc). TR39 does not forbid local-parts that do not meet this profile, it just makes those harder to use.

Nothing in TR39 keeps local-parts that don't meet this profile from working, it's just that browsers and MUAs (and terminal emulators, and anything that can linkify e-mail addresses) may not linkify them.

It must be in NFKC format

Oh, I suppose this makes sense, because some software might not perform any normalization when comparing local parts (e.g., to link an email address when a "contact" record in an application), but I suspect it's really not necessary.

Unicode doesn't recommend which <restriction level> to use regarding multiple scripts, though Google's email sending guidelines seem to say to use the "Highly Restrictive" one: https://support.google.com/a/answer/81126?hl=en. I haven't found any guidance from Microsoft or Yahoo regarding local-part restrictions. Microsoft at least has a page on EAI but not much guidance: https://learn.microsoft.com/en-us/globalization/reference/eai

Receiving systems shouldn't reject e-mail with sender (or cc'ed) addresses whose local-parts don't meet TR39.

IMO,

Nico --

gene-hightower commented 8 months ago

I strongly believe that the standards for email address syntax that should be used are RFC-5321 (+RFC-6531 to add Unicode) -- NOT RFC-5322 (+RFC-6532) as referred to in the HTML spec. See the "willful violation of RFC 5322" note in section 4.10.5.1.5 of the whatwg HTML spec.

The production we should be looking at is originally defined in Section 4.1.2 of RFC-5321, and later extended in section 3.3 of RFC-6531 to include UTF-8 characters.

This is the syntax widely used by Message Transfer Agents to move mail throughout the modern Internet.

I believe this is what most people think of as an "email address."

The grammars given in the standards RFC-5322 and RFC-6532 are about what can appear INSIDE the message contents. These grammars seem to cause confusion about what is and is not a valid email address.

The grammar in RFC-5321 is a regular language (type-3 in the Chomsky hierarchy) so, in principal, can be parsed by a regular expression (i.e. finite-state machine). The grammar in RFC-5322 3.4. "Address Specification" is recursive (type-0) causing implementers much grief if they don't appreciate that fact.

As to how address validation should be done, I agree with most of the email experts (vdukhovni, klensin) who have commented: it should be done exactly or not (very much) at all. I find it especially egregious to reject valid addresses based on "common practice" or the use of "simplified regexes" that implementers find easier to code up.

klensin commented 8 months ago

Gene,

--On Sunday, March 10, 2024 11:27 -0700 Gene Hightower @.***> wrote:

I strongly believe that the standards for email address syntax that should be used are RFC-5321 (+RFC-6531 to add Unicode) -- NOT RFC-5322 (+RFC-6532) as referred to in the HTML spec. See the "willful violation of RFC 5322" note in section 4.10.5.1.5 of the whatwg HTML spec.

I have felt strongly that way for a few decades now (with RFC 2821 before 5321 came along) and not just because of my connection to those documents. I'm been reluctant to take as strong a position as you just did above because, when I do so in the IETF, I am routinely either ignored or overwhelmed with noise. We have tried very hard to harmonize the two specs where they intersect -- rfc5321bis and rfc53221bis, now nearing completion, are closer together than 5321 and 5322-- but two key sources of difference remain. The obvious one is that the header specs (822/2822/5322/6532/5322bis) support the "name phrase" (sometimes called other things) which has nothing to do with the mailbox address and the transport/ envelope specs (821/2821/5321/6531/5321bis) know nothing about it. The other is that the header specs are more permissive in some ways because they assume mail might be handled by transports or other processing other than SMTP.

One caution about 6531/6532 is that both depend heavily on 6530 for context, definitions, etc. Trying to read either without some familiarly with 6530 is an invitation to trouble.

The production we should be looking at is originally defined in Section 4.1.2 of RFC-5321, and later extended in section 3.3 of RFC-6531 to include UTF-8 characters.

This is the syntax widely used by Message Transfer Agents to move mail throughout the modern Internet.

Yes.

I believe this is what most people think of as an "email address."

Also yes although, again, that can result in considerable dancing around in the IETF community, IMO largely due to some politics and personality issues that go back to the early 1980s.

The grammars given in the standards RFC-5322 and RFC-6532 are about what can appear INSIDE the message contents. These grammars seem to cause confusion about what is and is not a valid email address.

Exactly. See above.

The grammar in RFC-5321 is a regular language (type-3 in the Chomsky hierarchy) so, in principal, can be parsed by a regular expression (i.e. finite-state machine). The grammar in RFC-5322 3.4. "Address Specification" is recursive (type-0) causing implementers much grief if they don't appreciate that fact.

As to how address validation should be done, I agree with most of the email experts (vdukhovni, klensin) who have commented: it should be done exactly or not (very much) at all. I find it especially egregious to reject valid addresses based on "common practice" or the use of "simplified regexes" that implementers find easier to code up.

Agreed (perhaps obviously). The latter are notorious for rejecting widely used addresses in HTML forms, apparently because certain characters ("+" is notorious) have specific meanings in other HTML contexts.

Thanks very much for your perspective and analysis.

--john

john

hsivonen commented 7 months ago

The production we should be looking at is originally defined in Section 4.1.2 of RFC-5321, and later extended in section 3.3 of RFC-6531 to include UTF-8 characters.

This is the syntax widely used by Message Transfer Agents to move mail throughout the modern Internet.

I believe this is what most people think of as an "email address."

The comment from @annevk upthread indicates the the exclusion of Quoted-string is deliberate, and changing that would be a distinct issue from this one.

I suggest that for the local part in this issue we focus on what the HTML spec should say about Dot-string.

Currently, for the local part, the HTML spec allows any non-empty sequence of atext characters or the dot, but RFC 5321 prohibits leading and trailing dots as well as consecutive dots.

I suspect that not rejecting leading, trailing, and consecutive dots is currently intentional. Is it?

RFC 6531 extends atext with all non-ASCII Unicode scalar values (i.e. excluding surrogates). RFC 6532 says that NFC SHOULD be used.

AFAICT, for the local part, we need to answer these questions:

Is it the job of input type=email to reject leading, trailing, and consecutive dots in the local part?
Is it the job of input type=email to restrict the characters along the lines of UTS 39? (Probably not, since it would mean that a user whose provider has failed to enforce UTS 39 at registration time could not enter their address into forms. Personally, think not allowing characters that are part of living scripts but categorized as Limited_Use is the most problematic aspect of UTS 39, but I haven't discussed the issue with the authors of UTS 39.)
Is it the job of input type=email to normalize the local part to NFC for submission or require it to be in NFC for constraint validation? (The Web Platform in general does not normalize, but input type=email already involves normalization for the part after the @ in Firefox and Chrome (in Firefox only for constraint validation but in Chrome also for submission; the current spec text seems to support what Chrome does). NFC as constraint validation seems rather unhelpful compared to normalizing to NFC for submission.)

Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications .

All three major Web engines use UTS 46 with _TransitionalProcessing=false for handling domains in URLs. Chrome uses _TransitionalProcessing=true in input type=email, but surely that’s an oversight instead of being intentional.

What additional complications are you referring to? As I understand it, with _TransitionalProcessing=false UTS 46 has the following characteristics:

All domain names that IDNA 2008 permits the user to enter are valid and resolve the same way as in IDNA 2008.
Some inputs that IDNA 2008 prohibits are transformed into an IDNA 2008-permitted form according to IDNA 2003 (and pre-IDNA) principles. Notably, upper-case input is permitted and becomes lower-case.
Some inputs result in IDNA 2008-prohibited outputs but do so according to IDNA 2003 principles. Notably, various symbols are allowed by the algorithm though may be prohibited by registry policy from being registered.

Am I missing something?

I think it doesn’t make sense for Web engines to use something other than UTS 46 with _TransitionalProcessing=false for input type=email. It would be excessive to require Web engines to carry extra data for just one form field type.

I think having different ASCII constraints for domains in input type=email than for domains in URL would be OK. Notably, both spec-wise and implementation-wise right now, input type=email is stricter than URL. URL seeks to accommodate non-DNS naming systems (notably NetBIOS, though NetBIOS itself has converged towards STD3 ASCII deny list over the years), but I think it’s reasonable to take the position of assuming DNS naming for email.

Specifically, currently the spec says to enforce the DNS label length limit (but not the total length limit!), to enforce STD3 hyphen rules (but not the later reservation of hyphens in the third and fourth position!), to enforce the STD3 ASCII deny list, and to deny empty labels.

I think for the part after the last @, we need to answer these questions:

Shall IPv6 addresses be allowed? (These are currently prohibited.)
Shall IPv4 addresses be allowed? (These are currently syntactically allowed.)
Shall names that raise an error in UTS 46 processing with CheckBidi=true, CheckJoiners=true, _TransitionalProcessing=false, IgnoreInvalidPunycode=false fail constraint validation? (I think it’s easy to say that the answer here should be yes.)
What shall be submitted? (I think the least-risky answer is to say the UTS 46 ToASCII form with the above flags. Chrome already submits the ToASCII form and Safari only allows ASCII to be submitted. Firefox runs the constraint validation on the ToASCII form but submits what the user entered. Thus, it’s possible that there exist sites that work in Firefox only if the users enters the ASCII form.)
What ASCII deny list shall be used? (The spec currently amounts to saying that the STD3 ASCII deny list should be used. Currently all three browsers appear to enforce this, so the easy answer is that the STD3 ASCII deny list should be used.)
Shall empty labels be disallowed? (From the spec and the three major engines, the answer is clearly yes.)
Shall the DNS maximum label length limit be enforced (as computed from the ASCII form)? (From the spec and the three major engines, the answer is clearly yes.)
Should the DNS maximum total name length limit be enforced? (The spec says no, but that’s weird in the light of the previous point.)
In what positions shall the hyphen be rejected? (The spec says in the positions prohibited by STD3, i.e. first and last position in a label. This is the middle ground between UTS 46 CheckHyphens=true and CheckHyphens=false.)
What shall .value return when constraint validation fails? (Current engine practice says to return the user-entered value.)
What shall .value return when constraint validation succeeds? (Seems reasonable to return the same form that would be submitted.)

(I observe that currently the ToUnicode operation isn't exposed to the Web anywhere in the Web Platfrom. Additionally, I observe that it's probably always a bug to invoke the unmodified ToUnicode operation in browser implementation. Instead, non-Web-exposed parts of a browser should invoke a variant that on a per label basis decides between the Unicode and Punycode form based on security policy.)

hsivonen commented 7 months ago

One more thing: Since RFC 6531 excludes surrogates but does not exclude the REPLACEMENT CHARACTER, it would be prudent for the HTML spec to specifically call this out. At least in Gecko, it would be easy to end up with a bug around this detail.

gene-hightower commented 7 months ago

I suspect that not rejecting leading, trailing, and consecutive dots is currently intentional. Is it?

As with the exclusion of Quoted-string, it was deliberate but misguided.

The inclusion of the notice of “willful violation of RFC 5322” without any mention of RFC 5321 suggests to me that the email address syntax defined by the SMTP standard (the ‘Mailbox’ grammar rule from Section 4.1.2 of RFC-5321) was not considered and rejected, but simply overlooked.

Of course, only the authors of the whatwg.org spec could confirm that suspicion.

jrlevine commented 7 months ago

The basic problem here is that no real mail system allows the full range of addresses that the RFCs allow, and the people who wrote this part of the HTML spec believe that they know better than the RFC authors what addresses are likely to be valid. I'm reasonably sure they're wrong, but that's neither here nor there. There are still not many systems that issue UTF-8 addresses and neither I nor anyone else can tell you what they're likely to allow in local parts, nor even if they normalize or expect the client to do it. I think we can make a few basic assumptions like no sane system uses REPLACEMENT, but all the code points in whatever scripts the mail systems handle are fair game. (I also have never seen a real mail system with addresses that have two dots in a row. Who needs the grief?)

hsivonen commented 6 months ago

I suspect that not rejecting leading, trailing, and consecutive dots is currently intentional. Is it?

As with the exclusion of Quoted-string, it was deliberate but misguided.

Whether making input type=email not enforce dot placement is misguided depends on the degree to which other systems actually enforce dot placement and the degree to which people out there have addresses with leading, trailing, or consecutive dots.

Hixie introduced the current formulation on 2009-08-31 without the commit message explaining why. I don't see a clue about why in the IRC logs from that day, the day before, or the day after. I don't see an explanation in the public-html archive, either.

The inclusion of the notice of “willful violation of RFC 5322” without any mention of RFC 5321 suggests to me that the email address syntax defined by the SMTP standard (the ‘Mailbox’ grammar rule from Section 4.1.2 of RFC-5321) was not considered and rejected, but simply overlooked.

Looking at https://searchfox.org/whatwg-html/rev/e4de6f65b52c7198a4e08aa8aecd1110faa03093/source#27034 , the exclusion of CFWS and FWS was already in place earlier.

I don't know how the spec ended up referring to RFC 2822 and excluding CWFS and FWS instead of referring RFC 2821 to begin with. The first two digits of the RFC reference were updated in response to https://www.w3.org/Bugs/Public/show_bug.cgi?id=6300 , and even the more active IETF participant who commented there didn't point out the 22 vs. 21 issue.

nor even if they normalize or expect the client to do it.

It's pretty clear that if normalization isn't taking place on some layer, some users trying to use Vietnamese in email local parts are going to have a bad time.

Email addresses are supposed to have the property that if you see one (in your language/script), you can reproduce the address using text input (as opposed to copying and pasting an already-digital form). As far I am aware (and I've made an effort to become aware), text input methods in actual use produce NFC with one exception: The de jure Vietnamese keyboard layout.

Most people writing Vietnamese use a telex IME and those produce NFC. The de jure keyboard layout produces unnormalized text (neither NFC nor NFD).

So if no layer normalizes and you are in the minority using the de jure keyboard layout, you won't be able to send email to NFC addresses. If you are minting a new address with the de jure keyboard layout and the email provider doesn't normalize, most people won't be able to send email to you.

To the extent people use email providers from outside their sphere of locale expertise, it's not enough to say that Vietnamese providers will know about this issue.

Not sure what conclusion should be drawn for input type=email, but it seems pretty bad that NFC isn't on the MUST level in the RFC.

I think we can make a few basic assumptions like no sane system uses REPLACEMENT

The RFC doesn't make this basic assumption, even though the REPLACEMENT CHARACTER generally is a signal of a conversion failure somewhere.

(I also have never seen a real mail system with addresses that have two dots in a row. Who needs the grief?)

I don't recall seeing Quoted-string addresses in the wild. In general, existence proof anecdata is stronger than absence anecdata.

jrlevine commented 6 months ago

Email addresses are supposed to have the property that if you see one (in your language/script), you can reproduce the address using text input (as opposed to copying and pasting an already-digital form). As far I am aware (and I've made an effort to become aware), text input methods in actual use produce NFC with one exception: The de jure Vietnamese keyboard layout.

While that is a reasonable thing to ask, that's not what the RFCs say, and there just aren't enough systems with EAI mail addresses that you can expect them to do that. This is different from the question about dots where they really aren't likely to work. I agree that quoted strings are rare enough that it would be OK to exclude them, so long as there's a comment somewhere making it clear that's because they're rare, not because we think the spec is wrong.

klensin commented 6 months ago

Let me reluctantly step in on this from the perspective of an active user of email since before the ARPANET version was running over FTP, the editor of one of the two core email standards, and former chair of the EAI WG that produced the SMTPUTF-8 specs...

TL;DR summary: A mail-sending system, or one storing email addresses or future use or equivalent, cannot guess at decisions made about mailbox naming in recipient systems. In the mail specs themselves, SHOULD is a strong recommendation, not a requirement, and, where mailbox naming is concerned, is a recommendation for the systems hosting the mailboxes about how they should and should not be named. For mail-sending or address-storing systems, The Do No Harm principle should apply and identifying addresses of legitimate and in-use mailboxes as invalid is definitely harmful.

         ===========

Email addresses have been, and are, used for many purposes around the world, some of which create funny-looking results. Some applications incorporate strings drawn from hash functions into local-parts, so joe+Eo6Sr8skf0hGuHPW@example.com would not be particularly surprising even though it is ugly (and mixed case -- see below) . To the best of our ability, the standards have been written to be quite careful to avoid prohibiting strings that might make sense in special applications or contexts we can't see. That is different from giving advice to those who operate email servers and assign (or allow) particular mailbox names. The rule about NFC is an example of this: the text in RFC 6532 says "normalization form NFC [UNF] SHOULD be used...". That rule, and the much stronger one for IDNs in Section 4.1 of RFC 5891, are because we have had considerable experience with strings that are not-NFC compliant being the source of problems with, e.g., users looking at the printed forms of two strings and deciding they are the same. And that is in spite of the context in which The Unicode Standard defines NFC (and the other normalization forms), which is that they are intended to used in comparison operations, not as references for acceptable string formats.

That brings us back to the Vietnamese example. First, it is not the only writing system, even when one looks only at Latin script, that leads to odd cases when NFC is used as the measure of string correctness and/or differs from typical local input methods: the notorious dotless "i" and some issues with the various forms that can lead to a circular glyph (e.g., "o" but also "0" for selected type styles) with a diagonal bar across it (e.g., "/" and "ø") are other examples. So, the advice to someone constructing an email server and deciding what addresses to assign or allow is that, if circumstances and local conditions make one or more of those "SHOULD" conditions inappropriate or especially important -- with cases like the Vietnamese language one just one example of many-- they should do what seems best given those conditions while remaining aware that doing so might cause issues with some system somewhere in the world.

There is a useful (and very common) analogy to the Vietnamese situation as described in the posting above. RFCs 5321 and 5322 are very explicit that local parts of email addresses are case sensitive. If a system sending mail or one storing an address for local use decides, for example, that mixed-case strings are ugly and therefore transforms "JohnDoe@example.com" as provided by a user, into "johndoe@example.com", they risk making a message sent to the latter address instead of the specified one undeliverable or, worse, routing it to the wrong mailbox. At the same time, I hope the specs are clear by now that a receiving system that takes advantage of the case sensitivity rule so that, e.g., johndoe@example.com and JohnDoe@example.com go to different mailboxes and JOHNDOE@example.com should either have special reasons for doing that or are being dumb. If there are no special reasons, the few lines of code necessary to map all of the case variations on the local part are worth it and, unsurprisingly, those few lines of code are part of most modern email delivery systems and applied by default. For the Vietnamese case, I'd expect NFC to be applied by the delivery system associated with the mailbox and either to have the actual mailbox name be in NFC form or to apply NFC to both the mailbox name and the local-part of the incoming message to see if they match. On the other hand, if some mail administrator with mailboxes in Vietnamese wants to make it difficult for someone who does not have a locally-normal Vietnamese input device to send mail to a particular mailbox and decides to only allow the non-NFC form, the standards deliberately do not tell them they cannot do that and we shouldn't either.

Specifically, for something like HTML (or, for that matter, a non-web MUA that is being used to create a message to be sent), trying to impose restrictions on syntax other than those that are clear, MUST-level, violations of the standard amount to deciding that we not only know better about what might be appropriate at the receiving mail system than the decisions those operating those systems have made but that we can read their minds about the addresses they might allocate or allow (and perhaps why). That amounts to telling a user who "owns" a perfectly good email address, in many cases one that originates and receives mail messages every day, that they are not allowed to use it because someone is going through. or storing information via, a web interface that uses HTML and type=email won't allow it. That just does not seem right to me, and would probably seem even less right to such email users and their correspondents.

To be clear, this is not just about the NFC "restriction" or about case sensitivity. The same reasoning applies to quoted strings and dot-string. Whatever is done with HTML type=email, it should not get in the way of using legitimate email addresses even if we, or even the relevant RFCs, find those addresses or associated practices, distasteful.

Finally, let me draw on personal experience going back years but showing up most recently this week, to illustrate the problem past practices in this area have gotten us into. There are informal conventions about the use of "subaddresses" in email. Subaddresses are examples email addresses with perfectly valid local parts that, in the delivery system, are parsed and treated in some special way, e.g., to filter or organize messages. The most common syntax for them, predating the web by many years, uses a "+" sign as a delimiter, so addresses with syntax like tom+friend@example.com and tom+business@example.com are common. While one might guess, one cannot tell those are subaddress constructions or what might get done to them, nor anything else about how the delivery system might handle them. However, we have systems all over the Internet, most of them HTML-based, that act as if sending systems or those that store email addresses into databases, that treat those constructions as invalid (because they don't like "+" or give it a special interpretation), that decide it can be dropped so that tom+friend@example.com is mapped to tomfriend@example.com. The mail server at example.com might be programmed to assume those constructions are equivalent, but that would be fairly rare. For everyone else, such transformations on the sender side get perfectly valid addresses treated as invalid or messed up to the point of making them undeliverable (or causing delivery to the wrong mailbox).

My specific example this week involved registering a warranty and setting up an account with a fairly major hardware vendor, using one of those addresses, one that looked like me+vendorName@example.com. All appeared to work fine until I had occasion to log into the that account, at which point I discovered that there was no such account. It turned out the account lookup system couldn't handle the "+" in spite of the fact that the account creation system could. Why? Because one was handling the email address as a plain text string and the other was using '"type=email"'. We really need to stop doing things like that to ourselves and to users.

gene-hightower commented 6 months ago

The basic problem here is that no real mail system allows the full range of addresses that the RFCs allow

While it may be true that “no real mail system” will issue mailbox addresses spanning the full range of allowed syntax, most mail systems support sending to, and receiving from, any email address conforming to the syntax defined by the ‘Mailbox’ grammar rule from Section 4.1.2 of RFC-5321. (And increasingly the extensions in Section 3.3 of RFC-6531.)

We have a perfectly reasonable standard here, can we please just use it?

klensin commented 6 months ago

Gene, See my over-long note above but, basically, yes. With the same conclusion and, IMO, even more important, if an actual email address (aka "mailbox") exists and syntax of its name conforms to the requirements of the standards, systems sending to, receiving from, or trying to save or otherwise process, that address should not be getting in the way by declaring it invalid because of some made-up notion of what constitutes a valid and or good address.

I think we are in violent agreement.

gene-hightower commented 6 months ago

I don't know how the spec ended up referring to RFC 2822 and excluding CWFS and FWS instead of referring RFC 2821 to begin with.

One potential culprit is https://en.wikipedia.org/wiki/Email_address which has referred to RFC-2822/5322 as the relevant standard defining the syntax of "email addresses" ever since it's initial version in 2003. I have been trying to get the article changed, please comment in the talk section to help.

Of course, we all know that Wikipedia should not be taken as authoritative, but it has outsized influence.

aphillips commented 6 months ago

I18N discussed this earlier today (2024-05-02) in our teleconference (note that @klensin participated in that call). Our group position (and mine personally) is in rough agreement with John's comment. HTML is not mail agent and the input type=email is not always used for something email specific. Browsers aren't really in a position to do that much validation of the local part (they can do more with the domain name, obviously). With that in mind, not long ago in a WHATNOT call I took the action to update #5799 (which is meant to fix this issue). Is it possible that we could focus on what changes are needful there?

Previous Next