zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
442 stars 56 forks source link

double quotes " can break decoding #159

Open markusramsak opened 3 years ago

markusramsak commented 3 years ago

the following simplified original version CANNOT be parsed correctly because of the closing quote in the "From: " line.

Delivered-To: notimportant@gmail.com
Date: Thu, 10 Sep 2020 09:29:57 -0400
To: <notimportantto@gmail.com>
From: "Amway =?utf-8?q?=C3=96sterreich"?= <amway@amwayemail.com>
Subject: Amway Newsletter Nr. 18 - 10. September 2020
Message-ID: <SEMA-CR-1-1EMM4DXI@amwayemail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: Quoted-Printable

if I move the closing quote after ?=, it works.

Delivered-To: notimportant@gmail.com
Date: Thu, 10 Sep 2020 09:29:57 -0400
To: <notimportantto@gmail.com>
From: "Amway =?utf-8?q?=C3=96sterreich?=" <amway@amwayemail.com>
Subject: Amway Newsletter Nr. 18 - 10. September 2020
Message-ID: <SEMA-CR-1-1EMM4DXI@amwayemail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: Quoted-Printable

please fix that so the parser can handle this.

zbateson commented 3 years ago

Hi @markusramsak --

A quoted part takes precedence. Specifically, "An 'encoded-word' MUST NOT appear within a 'quoted-string'.", see https://tools.ietf.org/html/rfc2047#section-5

I believe what you're trying to say is the mime-encoded part isn't "decoded", but that's correct behaviour as far as I'm aware. It would be hard to build an exception for what you want without breaking what should be considered valid because the quotes are supposed to take precedence at least as far as I can tell.

Feel free to correct me with relevant examples, including handling by popular mail parsers or clients, or rfcs or other libraries that specifically are handling your situation differently to facilitate a discussion about it.

markusramsak commented 3 years ago

I know that it shouldn't happen but I am the programmer of a mail client with more than 100.000 mails to parse and display and the only thing I can say is, it happens. I just simplified the mail but the issue is real in every newsletter email from the company Amway (https://www.amway.at)

Other mail clients like gmail oder Apple Mail could decode this mail subject correctly - and I would like too.

Maybe it is just a matter of replacing "?=[space] by ?="[space] but I don't know if it would break anything

zbateson commented 3 years ago

Unfortunately the way the parser works, the 'part looking for quotes' is separate from the 'part looking for mime encoded parts'. It's semantically okay for a mime-encoded part to have a quote in it, it just won't be handled as a 'control character' terminating (or starting) a quoted-part.

markusramsak commented 3 years ago

if it can't be done on your side, than I would implement on my side to replace these wrong characters in the "From " line before it is parsed by your parser. I would call it "preparsing" because it happens before your complex parsing.

markusramsak commented 3 years ago

by the way you did an excellent job with this library! About 9995 out of 10000 emails can be parsed on average from my web mail client (backed by your library) without any issues.

zbateson commented 3 years ago

if it can't be done on your side, than I would implement on my side to replace these wrong characters in the "From " line before it is parsed by your parser.

I'm not sure that it can't, but it would be an effort -- I'd have to change the precedence of how things are parsed, which would make some valid but extremely unlikely cases invalid, like From: "My =?utf-8?Q?"weird"?= name" <blah@example.com>... (i.e. purposely containing what looks like a mime-encoded part in a name) but I can't imagine that would ever be an issue... there may be other things affected too because of how the parsers are built, it would have to be investigated.

If you're able to sanitize for exceptions you know of like that, I think that would be the way to go at least for now... we can leave this open and look when there's time or if it's affecting more people. You could also try emailing the folks at Amway to tell them there's an issue with their emails :) maybe they're using a house-built system that needs to be patched, or maybe it's a huge commercial system that means handling this scenario should be prioritized.

by the way you did an excellent job with this library! About 9995 out of 10000 emails can be parsed on average from my web mail client (backed by your library) without any issues.

Excellent, very happy to hear that!