zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
459 stars 58 forks source link

Decoding error on separated base64_encoded multibyte characters #128

Closed ThomasLandauer closed 4 years ago

ThomasLandauer commented 4 years ago

I have this in the email:

Subject: =?utf-8?B?b2LD?=
 =?utf-8?B?pHI=?=

$message->getHeader('subject')->getValue() gives me

ob??r

But php-mime-mail-parser's $parser->getHeader('subject') gives me

obär

Why is that?

Here's how I created the two parts:

$p1 = base64_encode(substr('obär', 0, 3));
$p2 = base64_encode(substr('obär', 3));

ä is a 2-byte character in UTF-8, and with substr() I only extract its first byte. Now I bas64_encode those 3 bytes to b2LD. If I decode it, I get ob plus the first half of ä (which doesn't make any sense it itself and is therefore represented by ?). Same for the second part.

Because you'll ask: Yes, I do have a real-world example for this :-) But it's impossible to estimate how widespread the "bug" is...

So what's the solution? Reassemble the two base64 strings to b2LDpHI=, then decode this:

var_dump(base64_decode($p1.$p2)); // => `obär`

Your library treats the two subject lines as separate HeaderParts, and therefore decodes each individually.

But read on, it gets worse ;-)

Let's add one character to the string:

$p1 = base64_encode(substr('fobär', 0, 4));
$p2 = base64_encode(substr('fobär', 4));

Now the first part is encoded to Zm9iww== (notice the two padding = at the end). But if you concatenate this with the second part (still pHI=), it doesn't work:

var_dump(base64_decode($p1.$p2)); // => `fob� G`

But somehow php-mime-mail-parser manages to get this to work:

$parser->getHeader('subject');

returns fobär!

zbateson commented 4 years ago

Hmm, interesting, but:

1) is that valid/legal? i.e. can you have a base64 encoded part split into two incomplete base64 parts? I've not encountered it, and I can't remember what the specs say 2) is it otherwise widespread?

Thanks

ThomasLandauer commented 4 years ago
  1. I was hoping you could tell me that ;-) But those parts aren't incomplete, since base64 encodes bytes, not characters, see https://stackoverflow.com/a/2587144/1668200 So which spec should forbid it?
  2. Well, even if every software does it this way, it won't happen too often ;-) You need base64 (instead of quoted-printable), you need a subject containing multibyte characters, you need a subject longer than 1 line, and you need bad luck to reach your line length limit just in the middle of this character.

But I must say (other than https://github.com/zbateson/mail-mime-parser/issues/119 and https://github.com/zbateson/mail-mime-parser/issues/120 ;-) I see this as a systematic bug, and I think it should be fixed! Even more as it works in php-mime-mail-parser.

zbateson commented 4 years ago

I was hoping you could tell me that ;-) But those parts aren't incomplete, since base64 encodes bytes, not characters, see https://stackoverflow.com/a/2587144/1668200 So which spec should forbid it?

If you use for instance mime_header_encode (or whatever the php function is) each part is a 'complete' base64 encoded part... yes, it encodes bytes... I'm aware of that, but a single byte encoded to base64 is followed by two '==' to make it complete a complete 'base64 encoded part'... i.e. the string 'a' is encoded to 'YQ==', the string 'aa' is encoded to 'YWE='... if you give me a string 'YQ' and tell me it's base64, it's incomplete... it's missing something... but you could give me one part that's 'YQ==', and another part 'YQ==', or a single 'YWE=' to mean the same thing.

zbateson commented 4 years ago

here you go: https://tools.ietf.org/html/rfc2047

Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, "=?charset?Q?=?= =?charset?Q?AB?=" would be illegal, because the two hex digits "AB" must follow the "=" in the same 'encoded-word'.)

Again though, I'm a parser, so my # 2 still applies from my comment.

ThomasLandauer commented 4 years ago

This is not an answer to your comments yet.

While investigating https://github.com/zbateson/mail-mime-parser/issues/127 I just realized that RFC 5322 only allows folding on existing whitespaces:

The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.

So the above subject does indeed look illegal at first glance, namely for the space at the beginning of the second line (there is no space in obär). However, what's the legal procedure to fold a line that just doesn't contain a whitespace?? Others are having the same problem: https://github.com/PHPMailer/PHPMailer/issues/1525#issuecomment-527225908

And without reading all of this, I'm guessing that adding that whitespace is the only reasonable solution ;-)

zbateson commented 4 years ago

@ThomasLandauer -- ooh man, sorry but you're going to have to do your own homework on these and present it if you find an issue -- I'm spending too much time on this, so I'll either be slowing down my responses or letting you investigate.

The spacing issue is not illegal in your example, just how it's split. Whitespace between two RFC 2047 parts is valid (and ignored), but each mime encoded header part (rfc 2047) needs to be complete.

zbateson commented 4 years ago

As the phpmailer example is saying "if the header's too long, use RFC 2047" because without that, a header that doesn't contain spaces can't be very long... but RFC 2047 will ignore whitespace between two encoded parts.

zbateson commented 4 years ago

Sorry I should say also I do appreciate that you're reviewing things though, it is helpful, just that my responses and investigations will be a bit more staggered for a bit while I do some of my dayjob work :)

ThomasLandauer commented 4 years ago

Just adding the even more relevant paragraph of RFC 2047:

Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.