Closed jugmac00 closed 3 years ago
Any opinion on this?
I'd wrap the three occurrences of formataddr
with a try except block, catching the UnicodeEncodeError
and raise a MailHostError
.
Is a pull request of this kind welcome?
Jürgen Gmach wrote at 2020-6-21 23:19 -0700:
Any opinion on this?
I'd wrap the three occurrences of
formataddr
with a try except block, catching theUnicodeEncodeError
and raise aMailHostError
.
I am in general unhappy with exception conversion (here "UnicodeEncodeError -> MailHostError"): it makes the analysis of problems significantly more complex because there is a risk that information is lost and it is difficult to analyse the state of the original exception in a debugger.
The information loss risk is less prominent with Python 3
because it supports nested exceptions but some components
(--> Products.SiteErrorLog
) do not yet fully support nested
exceptions.
The MIME standards documents describe how to handle
special characters in mail headers and I think
Python's email
package implement it well.
Thus, instead of a "UnicodeEncodeError -> MailHostError" conversion
I would prefer to use email
functionality to convert the incoming
data in a standard conform way and use the result. This should
avoid the UnicodeEncodeError
in the first place avoiding a need to
convert it into MailHostError
.
Thanks for your input, @d-maurer
Just to make sure we both speak about the same problem - my problem was one user entering an email address with German Umlauts, e.g. "jürgen.gmach@apis.de" (by mistake).
While handling other parts of the header seems to be no problem in general (name part of an email address or the subject), there seems to be no broad support for email addresses with non ascii characters in general.
Instead of handle special characters and trying to massage them, I'd say we should not allow email addresses with e.g. umlauts at all.
If we transform an email address, what happens with that email at any of the next smtp servers?
I suggested to convert formataddr
s UnicodeEncodeError
into a MailHostError
, as then the developers of Zope applications simply can catch a MailHostError
for all email related problems.
There is only one place where an uncaught UnicodeEncodeError
can happen in cpython's formataddr
, see
https://github.com/python/cpython/blob/36ff513f82e372ed3cea0bf7cbdf15a1ef6dab9e/Lib/email/utils.py#L91
more context: https://github.com/python/cpython/blob/36ff513f82e372ed3cea0bf7cbdf15a1ef6dab9e/Lib/email/utils.py#L76-L106 ... at least for Python 3.
I am -1 on transforming the email address, and +-0 on keeping the current state, which is kinda surprising for a Zope developer. Maybe it would be also ok to just amend the docstrings of MailHost
so we make clear that only ascii-safe email addresses are supported.
Then every user of MailHost has to do some input validation before calling any of MailHost's public methods.
Jürgen Gmach wrote at 2020-6-22 01:51 -0700:
... Just to make sure we both speak about the same problem - my problem was one user entering an email address with German Umlauts, e.g. "jürgen.gmach@apis.de" (by mistake).
I did not look at the issue and replied just to the comment mail. Thus, my reply may be partially inadequate.
The addresses in email messages can contain non ACII characters.
Those addresses have roughly the general form
"something" smtp_address
or something <stmp_address>
or
smtp_address
.
The "something" above can contain almost arbitrary characters
but those need to be get MIME compatible encoding before the message
can be sent.
smtp_address
may still be ASCII only. If so, I expect it to change
in the not so far future because one part of the smtp_address
is the domain and domain specifications have been internationalized
some years ago. You can now have international characters in
domain specifications -- expected to be encoded for underlaying
protocols (like HTTP; maybe smtp).
As it is now allowed to use international characters in domains,
it would surprise me when the far more end user relevant email addresses
would remain ASCII only.
... Instead of handle special characters and trying to massage them, I'd say we should not allow email addresses with e.g. umlauts at all.
This might still be appropriate for the time being (I have not checked whether there is already a standard for internationalized email addresses) -- but I assume this will no longer be the case in the future.
If we transform an email address, what happens with that email at any of the next smtp servers?
This is what should happen for wrong email addresses.
You can (possibly) detect some wrong email addresses (e.g. invalid syntax, including invalid characters) but you cannot detect all of them. The typical case "wrong email address" (misspelled, account deleted) can only be detected by mail transport system.
I suggested to convert
formataddr
sUnicodeEncodeError
into aMailHostError
, as then the developers of Zope applications simply can catch aMailHostError
for all email related problems.
I am still unhappy with this.
In your szenario wrong user input was responsible. I would prefer that this case would be detected far earlier - to provide immediate feedback to the user preventing him to enter invalid data into the system.
... I am -1 on transforming the email address
As explained above, I expect that in the future such a transformation will become necessary (in analogy to the processing of internationalized domain names). Its purpose is to transform an international email address into its ASCII encoding.
The transformation would likely be implemented by Python's email
package.
MailHost
s responsibility is to call the respective email
function.
Even with internationalized email addresses, there will still be invalid email addresses. The processing function may raise an exception for those cases.
I have not looked recently at MailHost
. It may already
do the right thing. And it may be that one of the exceptions
raised for invalid email addresses is UnicodeEncodeError
.
In this case, it the exception is is converted the new
exception should be specific, e.g. InvalidEmailAddress
not a general MailHostError
. InvalidEmailAddress
may
derive from MailHostError
and should contain sufficient information
about the original problem.
and +-0 on keeping the current state, which is kinda surprising for a Zope developer. Maybe it would be also ok to just amend the docstrings of
MailHost
so we make clear that only ascii-safe email addresses are supported.
In my view, it is not MailHost
's responsiblity to determine
which email addresses are supported. It should delegate
all email address handling to Python's email
package
and let this decide what it supports.
Then every user of MailHost has to do some input validation before calling any of MailHost's public methods.
The user will get an exception if the input is invalid (in an obvious way) -- even though it may not be a specific exception.
You might want to distinguish user errors from application (i.e.
programming) errors. But, ideally, user errors are detected
when the user provides the input. If this is done, then all
remaining errors are either application errors or temporary errors
(such as e.g. ConflictError
).
smtp_address
may still be ASCII only. If so, I expect it to change in the not so far future because one part of thesmtp_address
is the domain and domain specifications have been internationalized some years ago.
I referenced the cpython issue, where it was tried to implement smtp_addreses with non ASCII characters in my first post: https://bugs.python.org/issue11783
It does not look like it will be fixed anytime soon.
In your szenario wrong user input was responsible. I would prefer that this case would be detected far earlier - to provide immediate feedback to the user preventing him to enter invalid data into the system.
While it is common to do some "regex" check on email addresses, I have not seen encoding checks that high in the stack.
Luckily, I have wrapped all email building in a wrapper class of my own, so I have no problem to do those checks myself. Though, my initial idea was to handle this in MailHost, as other errors are handled there, too.
Speaking of error handling...
InvalidEmailAddress
may derive fromMailHostError
and should contain sufficient information about the original problem.
That was exactly my idea what to do, and I usually always build an exception hierarchy when I develop libraries. I did not propose this at first, as MailHost
and many Zope products do not seem to follow this schema.
e.g.from MailHost source code
MailHostError('Message already encoded')
and
MailHostError('No message recipients designated')
You might want to distinguish user errors from application (i.e. programming) errors.
I did not take notice of this aspect - thanks for pointing out!
There are many options what possible could be done now.
Until there is a decision what the best way is (even keep current's state), I will catch this kind of invalid email address (e.g. Umlaut) in my build_email
function, conveniently at one place.
If nothing will be changed on the MailHost code, at least the docstrings should be updated with a note of warning, that non-ascii characters are not allowed for the email address (the address part of it).
The issue is open for half a year - as there maybe be some free time during christmas holidays...
Any further or final opinion how an appropriate fix should look like?
Jürgen Gmach wrote at 2020-12-15 05:21 -0800:
The issue is open for half a year - as there maybe be some free time during christmas holidays...
Any further or final opinion how an appropriate fix should look like?
To which RFC's does your initial comment refer? As far as I know, email addresses are still ASCII only. The Python issue you mention relates to IDNA (= "Internationalized Domain NAme") - a way to have special characters in host/domain names -- not the name part of email addresses (as in your example).
You explained that the email was entered wrongly. In my view, input verification should handle wrong input. In is not a backend task to verify input.
Thank you for the feedback!
Closing this issue then.
Reproduction
Use an email address with a German umlaut, e.g.
jürgen.gmach@example.com
, and try to send an email viaMailHost.send
.This results in an
UnicodeEncodeError
.The error is thrown in
cpython
sformataddr
- which is only able to handle ascii safe email addresses. An attempt to implement modern RFCs have failed, see https://bugs.python.org/issue11783Suggestion
Instead of passing through the
UnicodeEncodeError
I suggest to catch the error inMailhost
and reraise anMailHostError
as it is done with other errors. That way one can catch all mail related errors more easily.P.S.: Hello from @ctheune