Closed ThomasLandauer closed 4 years ago
Hmm, interesting, but:
1) is that valid/legal? i.e. can you have a base64 encoded part split into two incomplete base64 parts? I've not encountered it, and I can't remember what the specs say 2) is it otherwise widespread?
Thanks
But I must say (other than https://github.com/zbateson/mail-mime-parser/issues/119 and https://github.com/zbateson/mail-mime-parser/issues/120 ;-) I see this as a systematic bug, and I think it should be fixed! Even more as it works in php-mime-mail-parser.
I was hoping you could tell me that ;-) But those parts aren't incomplete, since base64 encodes bytes, not characters, see https://stackoverflow.com/a/2587144/1668200 So which spec should forbid it?
If you use for instance mime_header_encode (or whatever the php function is) each part is a 'complete' base64 encoded part... yes, it encodes bytes... I'm aware of that, but a single byte encoded to base64 is followed by two '==' to make it complete a complete 'base64 encoded part'... i.e. the string 'a' is encoded to 'YQ==', the string 'aa' is encoded to 'YWE='... if you give me a string 'YQ' and tell me it's base64, it's incomplete... it's missing something... but you could give me one part that's 'YQ==', and another part 'YQ==', or a single 'YWE=' to mean the same thing.
here you go: https://tools.ietf.org/html/rfc2047
Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, "=?charset?Q?=?= =?charset?Q?AB?=" would be illegal, because the two hex digits "AB" must follow the "=" in the same 'encoded-word'.)
Again though, I'm a parser, so my # 2 still applies from my comment.
This is not an answer to your comments yet.
While investigating https://github.com/zbateson/mail-mime-parser/issues/127 I just realized that RFC 5322 only allows folding on existing whitespaces:
The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.
So the above subject does indeed look illegal at first glance, namely for the space at the beginning of the second line (there is no space in obär
). However, what's the legal procedure to fold a line that just doesn't contain a whitespace?? Others are having the same problem: https://github.com/PHPMailer/PHPMailer/issues/1525#issuecomment-527225908
And without reading all of this, I'm guessing that adding that whitespace is the only reasonable solution ;-)
@ThomasLandauer -- ooh man, sorry but you're going to have to do your own homework on these and present it if you find an issue -- I'm spending too much time on this, so I'll either be slowing down my responses or letting you investigate.
The spacing issue is not illegal in your example, just how it's split. Whitespace between two RFC 2047 parts is valid (and ignored), but each mime encoded header part (rfc 2047) needs to be complete.
As the phpmailer example is saying "if the header's too long, use RFC 2047" because without that, a header that doesn't contain spaces can't be very long... but RFC 2047 will ignore whitespace between two encoded parts.
Sorry I should say also I do appreciate that you're reviewing things though, it is helpful, just that my responses and investigations will be a bit more staggered for a bit while I do some of my dayjob work :)
Just adding the even more relevant paragraph of RFC 2047:
Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.
I have this in the email:
$message->getHeader('subject')->getValue()
gives meBut php-mime-mail-parser's
$parser->getHeader('subject')
gives meWhy is that?
Here's how I created the two parts:
ä
is a 2-byte character in UTF-8, and withsubstr()
I only extract its first byte. Now I bas64_encode those 3 bytes tob2LD
. If I decode it, I getob
plus the first half ofä
(which doesn't make any sense it itself and is therefore represented by?
). Same for the second part.Because you'll ask: Yes, I do have a real-world example for this :-) But it's impossible to estimate how widespread the "bug" is...
So what's the solution? Reassemble the two base64 strings to
b2LDpHI=
, then decode this:Your library treats the two subject lines as separate
HeaderPart
s, and therefore decodes each individually.But read on, it gets worse ;-)
Let's add one character to the string:
Now the first part is encoded to
Zm9iww==
(notice the two padding=
at the end). But if you concatenate this with the second part (stillpHI=
), it doesn't work:But somehow php-mime-mail-parser manages to get this to work:
returns
fobär
!