Using email addresses with German umlauts raises UnicodeEncodeError

jugmac00 commented 4 years ago

Reproduction

Use an email address with a German umlaut, e.g. jürgen.gmach@example.com, and try to send an email via MailHost.send.

This results in an UnicodeEncodeError.

The error is thrown in cpythons formataddr - which is only able to handle ascii safe email addresses. An attempt to implement modern RFCs have failed, see https://bugs.python.org/issue11783

Suggestion

Instead of passing through the UnicodeEncodeError I suggest to catch the error in Mailhost and reraise an MailHostError as it is done with other errors. That way one can catch all mail related errors more easily.

P.S.: Hello from @ctheune

jugmac00 commented 4 years ago

Any opinion on this?

I'd wrap the three occurrences of formataddr with a try except block, catching the UnicodeEncodeError and raise a MailHostError.

Is a pull request of this kind welcome?

d-maurer commented 4 years ago

Jürgen Gmach wrote at 2020-6-21 23:19 -0700:

Any opinion on this?

I'd wrap the three occurrences of formataddr with a try except block, catching the UnicodeEncodeError and raise a MailHostError.

I am in general unhappy with exception conversion (here "UnicodeEncodeError -> MailHostError"): it makes the analysis of problems significantly more complex because there is a risk that information is lost and it is difficult to analyse the state of the original exception in a debugger.

The information loss risk is less prominent with Python 3 because it supports nested exceptions but some components (--> Products.SiteErrorLog) do not yet fully support nested exceptions.

The MIME standards documents describe how to handle special characters in mail headers and I think Python's email package implement it well.

Thus, instead of a "UnicodeEncodeError -> MailHostError" conversion I would prefer to use email functionality to convert the incoming data in a standard conform way and use the result. This should avoid the UnicodeEncodeError in the first place avoiding a need to convert it into MailHostError.

jugmac00 commented 4 years ago

Thanks for your input, @d-maurer

Just to make sure we both speak about the same problem - my problem was one user entering an email address with German Umlauts, e.g. "jürgen.gmach@apis.de" (by mistake).

While handling other parts of the header seems to be no problem in general (name part of an email address or the subject), there seems to be no broad support for email addresses with non ascii characters in general.

Instead of handle special characters and trying to massage them, I'd say we should not allow email addresses with e.g. umlauts at all.

If we transform an email address, what happens with that email at any of the next smtp servers?

I suggested to convert formataddr s UnicodeEncodeError into a MailHostError, as then the developers of Zope applications simply can catch a MailHostError for all email related problems.

There is only one place where an uncaught UnicodeEncodeError can happen in cpython's formataddr, see https://github.com/python/cpython/blob/36ff513f82e372ed3cea0bf7cbdf15a1ef6dab9e/Lib/email/utils.py#L91

more context: https://github.com/python/cpython/blob/36ff513f82e372ed3cea0bf7cbdf15a1ef6dab9e/Lib/email/utils.py#L76-L106 ... at least for Python 3.

I am -1 on transforming the email address, and +-0 on keeping the current state, which is kinda surprising for a Zope developer. Maybe it would be also ok to just amend the docstrings of MailHost so we make clear that only ascii-safe email addresses are supported.

Then every user of MailHost has to do some input validation before calling any of MailHost's public methods.

d-maurer commented 4 years ago

Jürgen Gmach wrote at 2020-6-22 01:51 -0700:

... Just to make sure we both speak about the same problem - my problem was one user entering an email address with German Umlauts, e.g. "jürgen.gmach@apis.de" (by mistake).

I did not look at the issue and replied just to the comment mail. Thus, my reply may be partially inadequate.

The addresses in email messages can contain non ACII characters. Those addresses have roughly the general form "something" smtp_address or something <stmp_address> or smtp_address. The "something" above can contain almost arbitrary characters but those need to be get MIME compatible encoding before the message can be sent.

smtp_address may still be ASCII only. If so, I expect it to change in the not so far future because one part of the smtp_address is the domain and domain specifications have been internationalized some years ago. You can now have international characters in domain specifications -- expected to be encoded for underlaying protocols (like HTTP; maybe smtp). As it is now allowed to use international characters in domains, it would surprise me when the far more end user relevant email addresses would remain ASCII only.

... Instead of handle special characters and trying to massage them, I'd say we should not allow email addresses with e.g. umlauts at all.

This might still be appropriate for the time being (I have not checked whether there is already a standard for internationalized email addresses) -- but I assume this will no longer be the case in the future.

If we transform an email address, what happens with that email at any of the next smtp servers?

This is what should happen for wrong email addresses.

You can (possibly) detect some wrong email addresses (e.g. invalid syntax, including invalid characters) but you cannot detect all of them. The typical case "wrong email address" (misspelled, account deleted) can only be detected by mail transport system.

I suggested to convert formataddr s UnicodeEncodeError into a MailHostError, as then the developers of Zope applications simply can catch a MailHostError for all email related problems.

I am still unhappy with this.

In your szenario wrong user input was responsible. I would prefer that this case would be detected far earlier - to provide immediate feedback to the user preventing him to enter invalid data into the system.

... I am -1 on transforming the email address

As explained above, I expect that in the future such a transformation will become necessary (in analogy to the processing of internationalized domain names). Its purpose is to transform an international email address into its ASCII encoding.

The transformation would likely be implemented by Python's email package. MailHosts responsibility is to call the respective email function.

Even with internationalized email addresses, there will still be invalid email addresses. The processing function may raise an exception for those cases.

I have not looked recently at MailHost. It may already do the right thing. And it may be that one of the exceptions raised for invalid email addresses is UnicodeEncodeError. In this case, it the exception is is converted the new exception should be specific, e.g. InvalidEmailAddress not a general MailHostError. InvalidEmailAddress may derive from MailHostError and should contain sufficient information about the original problem.

and +-0 on keeping the current state, which is kinda surprising for a Zope developer. Maybe it would be also ok to just amend the docstrings of MailHost so we make clear that only ascii-safe email addresses are supported.

In my view, it is not MailHost's responsiblity to determine which email addresses are supported. It should delegate all email address handling to Python's email package and let this decide what it supports.

Then every user of MailHost has to do some input validation before calling any of MailHost's public methods.

The user will get an exception if the input is invalid (in an obvious way) -- even though it may not be a specific exception.

You might want to distinguish user errors from application (i.e. programming) errors. But, ideally, user errors are detected when the user provides the input. If this is done, then all remaining errors are either application errors or temporary errors (such as e.g. ConflictError).

jugmac00 commented 4 years ago

smtp_address may still be ASCII only. If so, I expect it to change in the not so far future because one part of the smtp_address is the domain and domain specifications have been internationalized some years ago.

I referenced the cpython issue, where it was tried to implement smtp_addreses with non ASCII characters in my first post: https://bugs.python.org/issue11783

It does not look like it will be fixed anytime soon.

In your szenario wrong user input was responsible. I would prefer that this case would be detected far earlier - to provide immediate feedback to the user preventing him to enter invalid data into the system.

While it is common to do some "regex" check on email addresses, I have not seen encoding checks that high in the stack.

Luckily, I have wrapped all email building in a wrapper class of my own, so I have no problem to do those checks myself. Though, my initial idea was to handle this in MailHost, as other errors are handled there, too.

Speaking of error handling...

InvalidEmailAddress may derive from MailHostError and should contain sufficient information about the original problem.

That was exactly my idea what to do, and I usually always build an exception hierarchy when I develop libraries. I did not propose this at first, as MailHost and many Zope products do not seem to follow this schema.

e.g.from MailHost source code

MailHostError('Message already encoded') and MailHostError('No message recipients designated')

You might want to distinguish user errors from application (i.e. programming) errors.

I did not take notice of this aspect - thanks for pointing out!

There are many options what possible could be done now.

Until there is a decision what the best way is (even keep current's state), I will catch this kind of invalid email address (e.g. Umlaut) in my build_email function, conveniently at one place.

If nothing will be changed on the MailHost code, at least the docstrings should be updated with a note of warning, that non-ascii characters are not allowed for the email address (the address part of it).

jugmac00 commented 3 years ago

The issue is open for half a year - as there maybe be some free time during christmas holidays...

Any further or final opinion how an appropriate fix should look like?

d-maurer commented 3 years ago

Jürgen Gmach wrote at 2020-12-15 05:21 -0800:

The issue is open for half a year - as there maybe be some free time during christmas holidays...

Any further or final opinion how an appropriate fix should look like?

To which RFC's does your initial comment refer? As far as I know, email addresses are still ASCII only. The Python issue you mention relates to IDNA (= "Internationalized Domain NAme") - a way to have special characters in host/domain names -- not the name part of email addresses (as in your example).

You explained that the email was entered wrongly. In my view, input verification should handle wrong input. In is not a backend task to verify input.

jugmac00 commented 3 years ago

Thank you for the feedback!

Closing this issue then.

zopefoundation / Products.MailHost

Using email addresses with German umlauts raises UnicodeEncodeError #29