zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
459 stars 58 forks source link

Ignore unencoded special characters #130

Closed ThomasLandauer closed 4 years ago

ThomasLandauer commented 4 years ago

Continuing my homework ;-)

If I have this in the email (notice the unencoded tab and ä):

Subject: =?utf-8?Q?f=C3=B6=C3=B6    bär?=

...$message->getHeader('subject')->getValue() just returns the undecoded string:

=?utf-8?Q?f=C3=B6=C3=B6 bär?=

However, php-mime-mail-parser returns föö bär, since it just throws the string into quoted_printable_decode().

I don't know if leaving some characters unencoded is legal or not - didn't look in the RFCs.

But my question is: Why are you doing more work (i.e. somehow "validate" the string), instead of just throwing it into quoted_printable_decode() and take whatever it returns? Where is this happening in your code (couldn't find it)?

zbateson commented 4 years ago

Continuing my homework ;-)

Hahaha

However, php-mime-mail-parser returns föö bär, since it just throws the string into quoted_printable_decode().

Well, that's easy... it's because I'm following the RFC :smile:

I wrote a parser to handle as much of the RFC as possible. That means you need to use whitespace as a delimiter, and is why RFC 2047 specifically prohibits whitespace in the 'encoded-word' part.

An 'encoded-word' is defined by the following ABNF grammar. The notation of RFC 822 is used, with the exception that white space characters MUST NOT appear between components of an 'encoded-word'.

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

I'm not specifically prohibiting whitespace, it's just that the 'delimiting' of components of a header happens before the decoding of RFC 2047 happens in most cases (there had to be an exception made for 'message-id' to allow it to happen first because of #109 ).

This way of doing things allows me to fully support 'valid' headers... with it's comments, weird nested comments, quoted parts, escaped characters, address groups, RFC 2047, RFC 2231, and whatever other weird things thrown at it.

I don't know specifically what of that is or isn't supported by php-mime-mail-parser and it probably doesn't matter... the quirkier bits of the standards are so rarely encountered anyway that it doesn't matter for it. My goals are different -- which is why I don't use that project as a gauge myself... but it's also why thinking up random scenarios and testing them might not be useful also... some standards need to be followed (you still put =?utf-8 in your test... what if the header had =&utf-8 instead? Point being, both are equally invalid) :stuck_out_tongue_closed_eyes:

ThomasLandauer commented 4 years ago

Here's the more relevant part of RFC 2047:

encoded-text = 1*<Any printable ASCII character other than "?" or SPACE> (but see "Use of encoded-words in message headers", section 5)

... and Tab and ä are no "printable ASCII characters".

Still wondering why quoted_printable_decode() does decode it. A possible explanation might be that they don't follow our "email" RFC 2047, but a (maybe) more liberal general-purpose quoted printable specification.

Anyway, you can set this to "wontfix" and close it :-)

zbateson commented 4 years ago

Yeah, quoted_printable_decode isn't a header-specific function (could be the body of a mime part with Content-Transfer-Encoding set to quoted-printable).