zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
455 stars 57 forks source link

`->getHeaderValue()` is normalizing whitespaces #127

Closed ThomasLandauer closed 3 years ago

ThomasLandauer commented 4 years ago

If I have this in the email (notice the two spaces):

Subject: foo  bar

...then $message->getHeader('subject')->getValue() and $message->getHeaderValue('subject') both give me this (notice the single space):

foo bar

I didn't look at your code yet, I wanted to ask you in advance: Are you doing this on purpose?

RFC 5322 says:

These are referred to as unstructured field bodies. Semantically, unstructured field bodies are simply to be treated as a single line of characters with no further processing (except for "folding" and "unfolding" as described in section 2.2.3).

zbateson commented 4 years ago

Without investigating I couldn't say either, haha :) let me know if you manage to have a look before I do here.

ThomasLandauer commented 4 years ago

I didn't read RFC 5322 from start to end. But as it says (see above): "with no further processing", I'm 99% sure that it's perfectly legal to have several whitespaces in the subject. After all, you certainly can have several whitespaces in the message's body ;-)

The problem comes from the fact that you split the subject on whitespaces into MimeLiteralParts. What's the purpose of that? Why don't you just keep it as a single string?

zbateson commented 4 years ago

It's because 'Subject' can actually contain RFC 2047 mime-encoded parts... that may be simply considered an 'extension' to 5322, and you may still be right regardless though (haven't looked for clues why I'm not preserving more than one whitespace, or if it's intentional).

zbateson commented 4 years ago

The reason this happens is this line here:

https://github.com/zbateson/mail-mime-parser/blob/1198b031303f9fdb521e5bad6ff2758d358dde47/src/Header/Consumer/AbstractConsumer.php#L228

Which as you've observed doesn't preserve multiple whitespaces in subjects (because the separator token is '\s+' as well). This works well for RFC 2047 encoding when they're next to text or next to each other, and how that's supposed to work.

I'm feeling torn on this one...

On the one hand, you're right -- the RFC doesn't specifically say they should be replaced by a single space as far as I can tell, but on the other hand I'm not sure most users would specifically write code to handle having the extra spaces either, and it may be more of an expectation that a subject wouldn't.

Open to hearing arguments on this one :).

zbateson commented 4 years ago

Also I'm not sure in this case that my usual Thunderbird test is valid... in this case I'm looking at their "display" which is different from their parsing also. Probably the Thunderbird 'positive' is more just that it wasn't handled specifically in that case, or that everywhere else it's being displayed as html anyway and supporting multiple spaces would be more work.

ThomasLandauer commented 4 years ago
zbateson commented 4 years ago

Yeah, I already agreed with all of those -- my remaining question is about the value in changing what's there and user expectation.

ThomasLandauer commented 4 years ago

Well, if you're asking me: You cannot be proud for following every RFC, and in this case ask for "user expectation"...

zbateson commented 4 years ago

Hahaha... well you got me there I guess :+1:

zbateson commented 3 years ago

Released in 1.3.0