whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.74k stars 2.54k forks source link

Validating internationalized mail addresses in <input type="email"> #4562

Open jrlevine opened 5 years ago

jrlevine commented 5 years ago

This is more or less the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 but I think it's worth another look since a lot of things have changed.

The issue is that the e-mail address validation pattern in sec 4.10.5.1.5 only accepts ASCII addresses, not EAI addresses. Since last time, large hosted mail systems including Gmail, Hotmail/Outlook, Yahoo/AOL (soon if not yet), and Coremail handle EAI mail. On smaller systems Postfix and Exim have EAI support enabled by a configuration flag.

On the other side, writing a Javascript pattern to validate EAI addresses has gotten a lot easier since JS now has Unicode character class patterns like /(\p{L}|\p{N})+/u which matches a string of letters and digits for a Unicode version of letters and digits.

Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

For the avoidance of doubt, when I say EAI, I mean both Unicode local parts and Unicode domain names, since that's what EAI mail systems handle. There is no benefit to translating IDNs to A-labels (the ones with punycode) since that's all handled deep inside the mail system.

josepharhar commented 3 years ago

Hey, coming here from this chrome bug. If I understand correctly, this means that we would send the email user@ß.com to the server as user@ß.com instead of the punycoded version user@ss.com like we do today, and we would also allow ß@ß.com to pass validation and send it as ß@ß.com. After reading the concern in this comment, I have a hard time believing that we wouldn't break some servers somewhere. Just because mail servers tend to accept more unicode doesn't mean that every mail server everywhere does now, right?

rutek commented 3 years ago

@josepharhar I agree that some servers can break (old ones but f.x. in Poland most popular e-mail providers are ... not working as they should) but please remember that we are still saying about client-side e-mail field validation.

RFC 6532 was not supported for a long time in many software apps (f.x. Thunderbird makes really strange things when receives non-encoded UTF-8 mail compilant with RFC 6532 - it's still open in Bugzilla) but up-to-date mail servers allow to create such accounts and send such mails (Postfix has support for it since ~2015). It's complex problem as f.x. delivery of UTF-8 mail to old mailbox can lead to some problems but what can we do else other than progressively upgrade used technologies to support it? :)

Anyway, I don't think that it's browser responsibility to "protect backend from problematic e-mail addresses" so if RFC allows it and up-to-date software supports it, we should allow it.

jrlevine commented 3 years ago

It's more complex than that, and it's not about ß which is an odd special case.

EAI (internationalized) mail can handle addresses like пример@Бориса.РФ. While the domain part can turn into ASCII A-labels xn--80abvxkh.xn--p1ai (sometimes called punycode), the mailbox cannot, and only an EAI mail system can handle that address. Common MTAs like postfix and exim have EAI support but it's not turned on by default, and there is no way a browser can tell what kind of MTA a remote server has or how it is configured. That's why we need a new input type="eaimail" that accepts EAI addresses, which web sites can use if their MTA handles EAI.

The treatment of ß has nothing to do with this. The obsolete IDN2003 and current IDN2008 internationalized domain names are almost the same but one of the few differences is that 2003 normalizes (not punycodes) ß to ss while 2008 makes it a valid character. An address with an ASCII mailbox like user@ß.com could turn into user@xn--zca.com but ß@ß.com is EAI only. This turns out to matter because there are German domain names with ß in them that your browser cannot reach if it uses the obsolete rules. See my page https://fuß.standcore.com to see what your browser does.

klensin commented 3 years ago

A few tiny additions and clarifications to John Levine;'s note (we do not disagree about the situation in any important way; the issues are just a bit more complex, with potentially broader implications, that one might infer from his message and they may call part of his suggestion into question). In particular, "eaimaill" or something like it may be the wrong solution to the problem and may dig us in even deeper. For those who lack the time or inclination to read a fairly long analysis and explanation, skip to the last paragraph.

First, while his explanation of the difficulty with ß is correct, it is perhaps useful to also note that the ß -> ss transformation is often brought about by the improper or premature application of NFKC, which may have been the source of the recent dust-up about phishing attacks using Mathematical special characters. In the latter case, IDNA2008 imposes a requirement on "lookup applications" (including browsers) to check for and reject such things but they obviously cannot do so if the characters the IDNA interface sees are already transformed to something valid. The current version of Charmod-norm discusses, and recommends against, general application of compatibility mappings. It is perhaps also worth noting that UTS #46 is still recommending the use for NFKC (as part of NFKC_Casefold and its associated tables (see Section 5 of that document)) but also calls out the problem of reaching some IDNA2008-conformant domain names if the IDNA2003 rules are followed. Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications .

John mentions that a browser cannot tell what the MTA and configuration a remote server might have, but it is even worse than that. In general, the browser is unlikely to know very much about the precise capabilities of the local MTA or Message Submission Server (MSA() unless those functions are actually built into the browser. The web page designer is even less likely to know and is in big trouble if different browsers behave differently. If the browser does not know, or cannot be configured to know, the distinction between an input type="email" and one of ""eaimail" (which I hope would be called something else, perhaps "i18nemail") would not be as useful as his message implies.

Thinking about these issues in terms of what mail systems do with the addresses my miss an important issue. In many cases, web pages are trying to accept and validate something that looks like an email address but is not headed immediately into a mail system. Instead, it is destined for insertion into a database or comparison with something already there, validation by some other process entirely, or is actually an email address (or something that looks like one) used as a personal identifier such as a user ID. For the latter case, conversion of the part of the string following the "@" via the Punycode algorithm may not produce a useful result whether IDNA2008, IDNA2003, or UTS #46 rules are used. I would think it would be dumb, but if someone wanted to allow 3!!!\@#$%^&.ØØØ as a user ID and some system wants to allow that, we should probably stay out of their way (perhaps by insisting they use a type that does not imply an email address). However, the other side of that example is probably relevant to the discussion. The operator or administration of a mail server, or the administrator of a system that uses email addresses as IDs, gets to pick the addresses they will allow. Especially in the ID case, if they use a set of rules narrower than what RFC 5821 allows (and that are allowed in addresses on many mail systems), then they open themselves up to many frustrations and complaints from from users whose email addresses are valid according to the standards and work perfectly well on most of the Internet but that are rejected by their systems. Internationalized addresses open up a different problem. As an example, I don't know many mail servers identified by domains subsidiary to the 公益 TLD have allowed registration of local parts in Tamil or Syriac scripts, but I suspect that "zero" wouldn't be a bad guess. Someone designing a web site for users in China might know that and, for the best quality user experience, might want to reject or produce messages about non-Chinese local parts for that domain or perhaps even for any Chinese-script and China-based TLD. Similar rules might be applied in other places to tie the syntax of the local part to the script of the TLD but, for example in countries where multiple scripts are in use and "official", such rules might be a disaster. And, because almost anyone can set up an email server and there are clearly people on the Internet who prioritize being clever or cute or exhibiting a maximum of their freedom of expression over what others might consider sensible or rational, most of us who have been around email for many years have seen some truly bizarre (but valid) local parts of all-ASCII addresses and see no reason to believe we won't see even worse excesses as the Internet becomes increasingly internationalized.

This leads me to a conclusion that is a bit different from when this was discussed at length over a year ago. As we have seen when web sites reject legitimate ASCII local parts because people somehow got in into their heads that most non-alphanumeric characters were forbidden or were stand-ins for something else and, more broadly, because it is generally impossible to know what a remote MTA with email accounts on it will allow in those accounts, trying to validate email addresses by syntax alone is hard and may not be productive. When one starts considering email addresses (or things that look like them) that contain non-ASCII characters, things get much more difficult. IDNA2008, IDNA2003, and UTS#46 (in either profile) each have slightly different ideas about what they consider valid. Whatever any of them allow is going to be a superset of what any sensible domain or mail administrator or will allow in practice. In general, a browser does not know what conventions back-end systems or a mail system at the far end of the Internet are following, much less whether they will be doing the same thing next month. So my suggestion would be that Input type="email" be interpreted and tested only as "sort of looks like an all-ASCII email address", that a new input type="i18nmail" be introduced as "looks like 'email' but with some non-ASCII characters strewn around", and that the notion of validating beyond those really general rules be left to the back-end systems, the remote "delivery" MTAs, and so on. In addition, to the extent to which one cares about the quality of the user experience, it may be time to start redesigning the APIs associated with various libraries and interfaces to that they can report back real information about why putative email addresses didn't work for them more precise than "failed" or "invalid address".

good luck to us all, john

nicowilliams commented 3 years ago

FYI, new installs of Postfix get EAI enabled by default.

My take is that a new input type is not required. An attribute by which to reject EAI is fair (e.g., because the site's MTAs don't support EAI on outbound.

jrlevine commented 3 years ago

s/reject/accept/ and I agree

nicowilliams commented 3 years ago

Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities.

So I'm just not very keen on the browser doing much validation here. If the site operator has / does not have a limitation as to outbound email, I'm fine with stating it, but I'm also fine with allowing whatever, and making it the backend's job (or any scripts' on the page) to do any validation.

My take is that the default should be permissive. This should be how it is in general. Consider what happens otherwise. You might have a page and site that can handle EAI just fine but a developer forgot to update their email inputs on their pages to say so: now you have a latent bug to be found by the first user who tries to enter an internationalized address. This might mean losing user engagement, and you might never find out because why would the users tell you? But, really, why do we need the input to do so much validation? The input has to be plausibly an email address -- a subset of RFC5322, mailbox-part@domain.part is plenty good enough for 99.999% of users, and there is no good validation to apply to the mailbox part. This is how users get upset that they can't have jane+ietf@janedoe.example. We should stop that kind of foot self-shooting.

vdukhovni commented 3 years ago

The user should able to enter an email address verbatim, with no second-guessing by input forms. If that address is known to be a-priori unworkable by the server's backend system, it can be rejected with an appropriate error message on the initial POST. Otherwise, if the address vaguely resembles mailbox syntax, it should be accepted and used verbatim. It may not be deliverable, but that's also true of many addresses that are syntactically boring john.smith@example.com may bounce while виктор1βετα@духовный.org may well be deliverable...

masinter commented 3 years ago

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

jrlevine commented 3 years ago

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

vdukhovni commented 3 years ago

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address. The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

The PCRE pattern behind the link is rather busted. It fails to properly validate dot-atoms, allowing multiple consecutive periods in unquoted local-parts (invalid addresses), while disallowing quoted local-parts (valid addresses). EAI-aside, this sort of fuzzy approximation of the actual requirements is harmful.

klensin commented 3 years ago

Hil Maybe it would be helpful to back up a little bit an look at this from the perspective of a fairly common use case. Suppose I have a web site that sets up or uses user accounts and that I've decided to use email addresses as user IDs (there are lots of reasons why that isn't a good idea, but the horse has left the barn and vanished over the horizon). Now, while it would probably not be a good practice, there is no inherent requirement that my system ever send email to that address -- it can be, as far as I'm concerned, just a funny-looking user ID. On the other hand, if I tell a user who has been successfully using a particular email address for a long time that their address is invalid, I am going to have one very annoyed user on my hands. If I am operating in an environment in which "user" is spelled "customer", and I don't have a better reason for rejecting that address than "W3C and WHATWG said it was ok to reject it" I may also be able to have various sales types, managers, and executives in my face.

The fact that email address is being used as a user ID probably answers another question. Suppose the user registers with an email address using native Unicode characters in both the local part and the domain part. Now suppose they come back a few weeks later and try to sign in using the same local part but a domain part that contains A-labels. Should the two be considered to match? Remembering that this is a user ID that has the syntax of an email address, not something that is going to be used exclusively in an email context, I'd say that is a business decision and not some HTML (or browsers, or similar tools) should get into the middle of. There is one exception. One of the key differences between IDNA2003 and IDNA2008 is that, in the latter, U-labels and A-labels are guaranteed to be duals of each other. If the browser or the back-end database system are stuck in IDNA2003 or most interpretations of UTR#46, then the fact that multiple source labels can map to a single punycode-encoded form opens the door to a variety of attacks and anyone deciding that the two are interchangeable in that environment has best be quite careful about what user names they allow and how they are treated.

It may also be a reasonable business decision in some cases for a site to say "we don't accept non-ASCII email addresses as user IDs/ account identifiers" or even "we accept addresses that uses these characters, or characters from a particular set of scripts, and not others". But nothing in the HTML rules about the valid syntax for email address should be in the middle of that decision.

Beyond that, as others have suggested, one just can't know whether an email address is valid without somehow asking the server that hosts the relevant mailbox (or its front end). It may not be possible to ask that question in real time and, even if it is, doing so is likely to require significantly more time (user-visible delay) than browser implementers have typically wanted to invest. So let's stick to syntax

That scenario by itself argues strongly for what I think John, Nico, and others are suggesting: the only validation HTML should be performing on something that is claimed to be an email address is conformity to the syntax restrictions in RFC 6531. Could one be even more liberal than that? Yes, but why bother.

aphillips commented 3 years ago

I was actioned by the W3C I18N WG with replying to this thread with a sense of the group.

Generally, we concur with @kleinsin's comment just above ⬆️.

We think that type=email should accept non-ASCII addresses the better to permit adoption of EAI and IDNA. One reason for low adoption of these are barriers to using them across the Web/Internet. Removing these types of artificial barriers will not only encourage adoption, but will support those users who are already using these.

Users of this feature in HTML expect that the input value follow the structural requirements of an email address but don't expect the value to be validated to be an actual valid address. At best this amounts to ensuring that there is an @ sign and maybe some other structure that can be found with a regex. Users who want to impose an ASCII restriction or do additional validation are free to do so and mostly have to do this anyway. In our opinion, HTML would thus be best off to provide minimal validation. User agents can use type=input as a hint for additional features (such as prompting the user with their own email address or providing access to the user's address book), but this is outside the realm of HTML itself.

annevk commented 3 years ago

I played with this a bit and it seems the current state is rather subpar, though that also leaves more room for changes. Example input: x@ñ. Firefox submits as-is (percent-encoded). Chrome submits x@xn--ida. Safari rejects (asks me to enter an email address). If you use ñ before the @ all reject (as expected).

One thing that would help here is a precise definition of the validation browsers would be expected to perform if we changed the current definition as well as tests for that. I can't really commit for Mozilla though if we can make this a bit more concrete I'd be happy to advocate for change.

nicowilliams commented 3 years ago

@aphillips @annevk just about the only thing worth validating here is the RHS of the @ -- everything else should be left to either the backend (which does or does not support internationalized mailbox names) or the MXes ultimately identified by the RHS of the @, or any MTAs in the path (which might not support internationalized mailbox names, but damn it, should).

What is the most minimal mailbox validation? Certainly: that it's not empty. Validating that the mailbox is not some garbage like just ASCII periods, and so on, might help, but getting that right is probably difficult.

So that's my advice: validate that the given address is of any RFC 5322 form that is ultimately of the form ${lhs}@${rhs}, that the RHS is a domainname, supporting U-labels because this is a UI element, as well as A-labels, and validate that the LHS is not empty, and keep any further LHS validation to the utter minimum, in particular not rejecting non-ASCII Unicode.

klensin commented 3 years ago

@annevk, I think your examples actually point out the problem. In order: it would be rare, but not impossible (details on request but I want to keep this relatively short) to see on on the RHS of the "@", and % is prohibited by the syntax in RFC 5321 , but I'd generally recommend the use of percent-encoding in any part of email addresses. Pushing a domain-part through Punycode is prohibited by IDNA unless the labels it contains are validated to be U-labels. I can't tell from your example but if, e.g., the domain -part of the mailbox was \u1D7AA\u1D7C2 then it should be rejected, not encoded with punycode: doing otherwise invites errors down the line, errors for which the user get obscure and/or misleading messages.

The problem is that email addresses with non-ASCII characters in the local-part and/or domain part are now valid and increasing numbers of people who can use them for email are expecting to use them through web interfaces.
Keeping in mind that a browser cannot ever fully "validate" an email address (something that would require knowing that the mailbox xyz@example.com exists but abc@example.com does not) I suggest:

(1) If a mailbox consists of a string of between 1 and 64 octets, an "@", and at least 2 and up to 255 more octets, treat it as acceptable and move on, understanding that all sorts of things may apply additional restrictions in actual email handling.

(2) In addition, if you wanted to and the domain-part contained non-ASCII characters, you could verify that any labels were valid ISDNA2008 U-labels and reject the name if they were not ("invalid domain name in email address:" would be a much better message than "invalid email address") AND, optionally iff the local-part was entirely ASCII, convert those U-labels to A-labels. The SMTPUTF8 ("EAI") specs strongly recommend against making that conversion if the local-part is all-ASCII. When the local part is all-ASCII, the conversion will allow some valid cases to go through but, over time, it seems likely that those cases will become, percentagewise, less frequent so whether it is worth the effort is somewhat questionable.

FWIW, the above was written in parallel with @nicowilliams's comment rather than after studying it, but that his recommendation and mine are not significantly different except for that one marginal case of an ASCII local-part and a non-ASCII (but IDNA2008-valid) domain part.

klensin commented 3 years ago

I should have added, as @vdukhovni more or less points out, if one is going to try to validate the syntax of the local-part (even all-ASCII local-parts) if it important to actually get it right. As he shows, getting it right is a moderately complicated process, perhaps best left to email systems that are doing those checks anyway (which is what @nicowilliams and I essentially suggest above). But, if one is going to try to do it, it should be done right because halfway attempts (fuzzy approximations) are harmful, including letting some local-parts with invalid syntax through and prohibiting some valid ones.

annevk commented 3 years ago

@klensin I'm not sure what you're trying to convince me of. I was offering to help. (Percent-encoding is just part of the MIME type form submission uses by default, it's immaterial. Chrome's Punycode handling is what is encouraged by HTML today. That browsers do incompatible things suggests it might be possible to change the current handling.)

aphillips commented 3 years ago

@annevk I drew an action item (during part of I18N's meeting when @klensin was not available) to propose changes and I'd appreciate your thoughts on how to approach this. Looking at the current text, I guess a question is whether we should attempt to preserve the current behavior for ASCII email addresses (or their LHS/RHS parts) while simultaneously allowing labels in that use non-ASCII Unicode? I18N WG participants seem to agree that we don't want to get into deep validation of the address's validity and limit ourselves to "structurally valid" addresses.

annevk commented 3 years ago

Right, e.g., at a minimum we should probably require that the string contains a @ and no surrogates. But currently we also prohibit various types of ASCII labels, e.g., quoted ones, and allowing those to now go through might not be great either.

jrlevine commented 3 years ago

It certainly has to be valid Unicode (e.g., no unpaired UTF-16 surrogates, no invalid UTF-8 bytes), and follow the rules like no unpaired quotes. Restricting it more than that is not likely to help.

masinter commented 3 years ago

Even if people are just using things that look like email addresses for purposes other than sending email, do you really want to allow unnormalized Unicode or leading or trailing white space in the LHS? for sites that use email addresses as user IDs, changing HTML validation to allow entry of different sequences that are visually identical opens up new security concerns.

nicowilliams commented 3 years ago

@masinter Absolutely this must allow unnormalized Unicode because users cannot be counted to produce normalized Unicode. Regarding whitespace, trimming it is fine. I don't think there are any security concerns regarding client-side validation -- if there is a site where relaxing client-side validation of email addresses creates a security concern, then the site is already vulnerable.

jrlevine commented 3 years ago

Mailbox names are pretty much arbitrary UTF-8. It doesn't have to be normalized, for that matter, it can be a sequence of ZWJ and Arabic combining marks. While I agree that no sensible mail provider would use names like that, we don't get to tell people to be sensible. White space has to be quoted so unquoted trailing whitespace isn't valid, although unquoted NBSP and NNBSP is.

nicowilliams commented 3 years ago

@aphillips @annevk See above. Do less validation. Validate only:

In all cases allow Unicode throughout.

Trim whitespace, sure.

Anything else?

jrlevine commented 3 years ago

the RHS has to be a hostname, which limits the characters to the ones valid in U-labels

masinter commented 3 years ago

Validating internationalized mail addresses in Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

aphillips commented 3 years ago

@masinter User's don't distinguish between entering person@example.com and персон@еџампле.ру when using email. If we create indistinguishable input boxes for this, users and content authors will be confused by the difference. It creates another barrier to more-widespread adoption of IDN and SMTPUTF8. The end-to-end folks have been pestering us (I18N) for years about this. Since browsers are inconsistent anyway and users need to process the values they are sent (which already have spoofing or other garbage injection possibilities), this is an opportunity to be done with the problem.

Would an alternative be to add a "legacy" attribute?

@nicowilliams foo@localhost doesn't have a dot. That's one reason (among several) that the current regex makes * ('.' label) on RHS optional.

nicowilliams commented 3 years ago

@aphillips Really, users input foo@localhost into these elements? Fine.

I agree with you regarding not wanting to type EAI vs. not-EAI. Users don't and shouldn't have to know.

@masinter

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

Again, if relaxing client-side validation "causes" a security problem, then the security problem already exists. Relaxing client-side validation cannot cause a security problem on the server side!

Also, the server-side that gets a form with email address inputs should NOT normalize the mailbox part. Leave that to mail software, specifically the last hop MTA should normalize the mailbox part if at all (it could use form-insensitive matching of mailbox names). The mailbox part is for all intents and purposes opaque to all relays.

masinter commented 3 years ago

I forgot -- form fields (including those with type="email") are encoded using the charset of the form, not utf8. so anyone trying to enter an EAI into an input-field in a (non utf8) form will have trouble because there is no way to represent the characters.

nicowilliams commented 3 years ago

Well, certainly you can (and should) set the charset to be UTF-8. If the charset is something other than UTF-8, well, I'm not sure I care what happens then to non-ASCII input that can't be represented as whatever the chosen charset was, but certainly EAI addresses that use only characters that can be represented in whatever that charset is will survive the POSTing of the form, and then the server can convert to UTF-8 or UTF-16 as needed.

The fact that you could set the form's charset to anything other than a Unicode encoding does not mean we can't internationalize form inputs.

jrlevine commented 3 years ago

I find it hard to care about people who expect EAI addresses but use an encoding other than UTF-8 or (for backward compatibility only) UTF-16.

aphillips commented 3 years ago

Browsers encode characters not supported by the charset of the form as decimal NCRs (i.e. &#1234;)--appropriately percent encoded as needs be. Note that the accept charset of the form does not need to match the page's encoding. Actual user interaction with a page is always in Unicode--charset is just a wire encoding phenomenon. I can't quite find the reference in the html spec where form submission does this, but you can test it for yourself easily enough :-).

masinter commented 3 years ago

on unicode normalization. Let's suppose there are two systems (one for a phone and another for a desktop) that handle the encoding differently, one produces unnormalized unicode and the other produces normalized unicode on entry. (mac and windows with Vietnamese?) The two forms are visually completely indistinguishable. The server accepts the form data and displays a confirmation "Is this the email address you meant?" and displays it in a font that distinguishes between I and l and 1 and |. The problem is that even if downstream mail software handles the equivalence, the end user will be unhappy if they subscribe on one device and try to unsubscribe with the other.
In this case there is no particular "security" problem, but it's a usability problem that the form and server software wasn't prepared to deal with back when type="email" implied ASCII. The URL standard starts with normalization, why not for EAI?

masinter commented 3 years ago

In reply to @aphillips "Browsers encode characters not supported by the charset of the form as decimal NCRs (i.e. Ӓ)--appropriately percent encoded as needs be. " not true; browsers may accept NCRs with unicode code-points but they don't generate them when POSTing form data:

https://url.spec.whatwg.org/#application/x-www-form-urlencoded https://github.com/whatwg/url/issues/452#issuecomment-658639752

jrlevine commented 3 years ago

Because there is an EAI mail standard in RFC 6531 and that's not what it says. Surely this is not a surprise.

See Klensin's comment about using addresses as account identifiers.

nicowilliams commented 3 years ago

The problem is that even if downstream mail software handles the equivalence, the end user will be unhappy if they subscribe on one device and try to unsubscribe with the other.

There is an easy answer to this: normalize for comparison (form-insensitive comparison) but store as given if you store at all. I.e., be form-insensitive, but form-preserving. Just as one typically does with case in case-insensitive systems.

Form equivalence issues are very similar to case equivalence in case-insensitive systems!

When you design a case-insensitive system, the simplest thing to do is to: "normalize" case (i.e., case-fold) during string comparison and for indexing tables, but otherwise storing with the CaSe aS gIvEn.

The problem you mention happens as to case with all-ASCII email addresses today because even though mailbox names are case-sensitive, often they are implemented as case-insensitive, such that foo@gmail.com == Foo@gmail.com == FoO@gmail.com == ... But that problem doesn't have to happen at all as to form because where it matters the comparisons/lookups really have to be form-insensitive, and IMO normalizing at the UI is not a good answer. Though I won't be upset if browsers do normalize mailbox names, I don't think they should have to, and I would much prefer that they not normalize mailbox names at all.

klensin commented 3 years ago

@masinter: Larry, as John Levine points out, that is just not what the specs say. Could the trade-offs have been evaluated differently circa nine years ago, leading to a set of rules you would like better now? Yes, probably. But they weren't and I note that, IIR, you did not participate significantly in the EAI WG nor raise these issues on IETF Last call. There are several things in the specs the EAI WG produced that ended up that way because no one considered alternatives; this is not one of them. If you think we got it wrong, you know how to proceed: create an I-D explaining what was wrong about it, propose a change, and see what traction it gets.

I don't see what arguing for a different treatment here accomplishes. I do think that, for this particular effort, the principal consideration should be that, if users have email addresses that conform to the relevant standards and work well in the Internet mail system, HTML should neither tell those users that those addresses are invalid nor map them into something that the mail system might consider different. If you disagree with that as a principle, let's discuss it, not whether the specs should be different (or counterfactual ideas about how they work).

As Nico points out, if someone wants or needs to do a back-end comparison, that may be entirely reasonable (with normalization before comparison being an obvious possibility) as long as it is remembered (if it might be relevant) that, as far as the mail system is concerned, such comparisons are a bit fuzzy and might produce false matches.

I could write much more about this and some of the details and trade-offs (and actually did but decided to not send it). I hope I don't have to.

JDLH commented 3 years ago

@klensin , I want to be sure I understand your 15. July comment, but there are a couple of phrases I am struggling to parse:

it would be rare… to see on on the RHS of the "@"…

I don't understand the repeated words "on on". To see what on the RHS? Did the system eat a word or a character which you intended to be there?

… but I'd generally recommend the use of percent-encoding in any part of email addresses.

I don't understand if you are recommending for or against percent-encoding. "I'd generally recommend… in any part" seems to mean you are in favour, but the wording "in any part" fits better with a negative, and I read the context to imply you recommend [against] "the use of percent-encoding in any part…".

This is a fascinating discussion, and I am learning a lot. I want to be sure I am understanding. My apologies if I am being dense.

masinter commented 3 years ago

The only thing I can find is this presentation which wasn't helpful. It's fine with me to introduce a new feature that it now accepts IDN and Unicode strings where it didn't before. Usually the browsers like to warn people when they might suddenly get form values they weren't expecting (especially if they used multipart/form-data with text/plain;charset="utf8" ).

Getting the form to accept Unicode in email addresses is just opening the front door to making the rest of the infrastructure actually work. People who maintain those web sites will have to test, and testing isn't easy.

Usually this kind of thing is staged, people are at least warned (like with dropping ftp:). Better would be to define new and deprecate old.

klensin commented 3 years ago

@JDLH: First, you are not being dense. The comment suffers from two problems: (1) I'm tired, short on time, preoccupied with other things, and a tad frustrated with aspects of the conversation including feeling like we've had parts of it more than once before. (2) Sadly, some of the specs involved and their provisions that bear on this subject are more complicated then one might wish and, in a more perfect world, probably one in which everything had been developed at the same time, all the pieces might fit together in a much more simple and elegant way. The combination of the two, and really not wanting to explain the whole history of non-ASCII email addresses and headers and non-ASCII domain name s here results in my writing too rapidly and making silly typographical or pasting errors. To try to answer your questions.

The part of the phrase with the double "on" should have read something more like "... to see one (a "%") on the RHS of the "@", and % is prohibited...". That was intended to be a short way to explain that, while many of us would use a sequence of rude words to describe the wisdom of taking advantage of it, foo%bar.example.com is a perfectly valid name as far as the DNS specifications are concerned. Because RFC 5321 (the SMTP spec, originally RFC 821) will not allow it, the DNS specs consider someone who sets up a name like "foo%bar" imprudent, but, again, it is not invalid.

"but I'd generally recommend the use of percent-encoding in any part of email addresses" should, as you surmised, have been "but I'd generally not recommend..." This is one of those places where the pieces don't quite fit together. In URIs and related contexts, "%", as you know, introduces two hex digits to represent an octet. Historically, many mail systems have interpreted an email address like xyz%example.net@example.com or, more to the point, user%example.earn@mitvma.mit.edu" as an indication that the message be delivered, using SMTP to example.com or mitivma.mit.edu (or wherever their MX records point) with the expectation that they will figure out how to deliver to xyz at example.net or user at example.earn, whatever those "hosts" mean to that system However, because those delivery hosts (more or less the RHS of the "@" are free to interpret local parts any way they like, xyz%example.com could be treated as an atomic mailbox name on the local host, example.earn could be mapped into something else entirely, and, indeed, abc.def%joe.smith@example.com could be interpreted as Joe Smith at def.abc, whatever that might mean -- entirely up to that delivery system. While none of those examples are particularly problematic relative to the URI use of %, consider xyz%40example.net@example.com or jos%c3%a8.doe@example.com and the number of different ways that can be interpreted in an environment that supports non-ASCII local-parts.

Is that a bit more clear?

john

JDLH commented 3 years ago

@klensin Thank you for the clarification. I understand the 15. July comment much better now.

And I understand the frustration with the conversation, and the fact that the pieces don't quite fit together. I notice, for example, that Github's comment-formatting software is doing it's best to linkify email addresses, but still leaves off the prefix of an email address like "user%example.earn@mitvma.mit.edu" because of the '%' character.

nicowilliams commented 3 years ago

@masinter Do you disagree with the assertion that relaxing client-side validation cannot cause a server-side vulnerability?

nicowilliams commented 3 years ago

Side note: One might think that normalizing to compare strings is expensive, but that is not so. There are a number of optimizations that can make comparison of mostly-ASCII and mostly-already-normalized strings very fast indeed. First off, normalize one character at a time. Second, normalize only when needed -- a character that consists of just an ASCII codepoint requires no normalization, and an ASCII codepoint cannot combine with a preceding one, thus in a sequence like ab, the second codepoint makes it clear that the first requires no normalization. Third, memcmp() equality means no normalization is needed -- normalize only when you stumble onto codepoints that might be parts of characters that require normalization, but only when the other string differs at these codepoints. Third, one can implement glibc-style optimization where if possible due to alignment, you load and compare 4 octets at a time, only here you can mask with 0x80808080 and if that's 0 == then you can take a fast path for the first three bytes and if it is not then you take a slow path. The downside is that the worst case will be somewhat slower than normalizing the full input strings would have been to begin with. Point is, if you think form-insensitivity must be slow, think again.

masinter commented 3 years ago

@nicowilliams I disagree with the idea that there are "server side" and "client side" vulnerabilities. Most of the vulnerabilities are due to the human user in the loop, and what a person would expect and enter in the overall situation, as mediated by the user agent. Is phishing client side or server side?

nicowilliams commented 3 years ago

@masinter Earlier you were saying that relaxing client-side validation would expose server-side issues, but now you're just changing the topic. Are you saying there's a phishing issue?

aphillips commented 3 years ago

In working on a PR (so we can discuss text directly), the main change needed is in the ABNF for "valid email address". There are different choices for how to approach this, so I thought I'd seek input (this may help other parts of this discussion as well). Here's the current ANBF (notice that it doesn't allow either non-ASCII domain names or local parts):

email         = 1*( atext / "." ) "@" label *( "." label )
label         = let-dig [ [ ldh-str ] let-dig ]  ; limited to a length of 63 characters by RFC 1034 section 3.5
atext         = < as defined in RFC 5322 section 3.2.3 >
let-dig       = < as defined in RFC 1034 section 3.5 >
ldh-str       = < as defined in RFC 1034 section 3.5 >

One approach would be to fix atext by moving to the definition in RFC6532. This has the advantage of being simple to read and keeping the definitions in the RFCs (rather than extracting them), although it hides the change somewhat. Note that the rest of the text in the section would make plain what happened:

email         = 1*( atext / "." ) "@" label *( "." label )
atext         = < as defined in RFC 6532 section 3.2 >
label         = 1*63( atext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

If we import all of the definitions directly, we get a fair bit of (byte-oriented) gunk:

email           = 1*( atext / "." ) "@" label *( "." label )
atext           = ALPHA / DIGIT /         ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials or a valid UTF-8 
                       "&" / "'" /        ;  non-ASCII sequence of
                       "*" / "+" /        ;  2 to 4 bytes
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~" / UTF8-non-ascii
UTF8-non-ascii  =   UTF8-2 / UTF8-3 / UTF8-4
UTF8-2          =   <Defined in Section 4 of RFC3629>
UTF8-3          =   <Defined in Section 4 of RFC3629>
UTF8-4          =   <Defined in Section 4 of RFC3629>
label           =   1*63( atext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

A cleaner solution might be to use Unicode code points like so:

email         = 1*( utext / "." ) "@" label *( "." label )
atext         = < as defined in RFC 5322 section 3.2.3 >
utext         = atext / %x80-D7FF / %E000-10FFFF ; unreserved printable ASCII characters or any non-ASCII Unicode code points
label         = 1*63( utext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

I think I prefer the last one by a fraction over the first one. What do others think?

jrlevine commented 3 years ago

After a great deal of discussion I think we have agreed that at this point we have no idea what people will actually allow in EAI addresses, and it is not easy to describe likely possibilities as REs. Useful addresses will likely be in a single script or a small set of compatible scripts, but good luck describing that. So I think you're pretty close. The local part is limited to 64 octets and some MTAs enforce that so the first rule should be: email = 1*64( utext / "." ) "@" label *( "." label ) Some character combinations in local parts need to be quoted, such as two dots in a row, but nobody uses addresses like that so don't bother.

The label length limit is actually the limit on the ASCII A-label which can be longer or shorter than the corresponding U-label. For example, this 63 octet A-label: xn--fiqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa corresponds to this U-label: 中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中 which is 57 utexts or 171 octets. That means 63 is wrong, but it's no wronger than any other number. The set of characters allowed in a U-label is much smaller than your utext but again, it's not practical to describe in an RE. You really need to run it through something like libidn2 to try to normalize and convert it, but I don't think that exists in Javascript.

aphillips commented 3 years ago

@jrlevine Thanks. I knew about the 64 octet limit, but the existing ABNF didn't implement it (and UTF-8 non-ASCII code points are not octets either). I could test if browsers limit the LHS before implementing limits in the ABNF.

I'm also aware of the encoding efficiency relationship between A-labels and U-labels in terms of the 63 octet limit. As you note, we're not going to describe the actual limit using regex. For those not familiar with how punycode works, 57 code points is the upper limit for a non-ASCII-containing label and occurs when the same non-ASCII character is repeated. If confining oursleves to planes 0 through 0x3, a U-label can reach the 63 octet limit in as few as 14 code points by choosing code points that are evenly spaced apart. This would make the label illegal in other ways [crossing script boundaries mainly], although a Han or Hangul label might get close to this number. We might be able to describe the upper limit since it is structural:

label = 1*63( atext ) / 1*57( utext )

... although it's pretty weird (the 57 isn't a guarantee of anything, while the 63 would. I note that I flubbed here, since let-dig and ldh-str in the original are more restrictive than the atext production (most of that punctuation). I need to fix that.

jrlevine commented 3 years ago

I'd just leave the labels as 1*63( utext ), since utext is a superset of atext. This pattern can only be an approximation of what's legal so I wouldn't try too hard to be clever. I think it matches actual addresses pretty well but there are valid addresses it'll reject like "...."@example.com and invalid ones it'll accept like ....@example.com or anything with a non-existent domain name.