zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
458 stars 58 forks source link

Detect wrongly indented header lines #120

Closed ThomasLandauer closed 4 years ago

ThomasLandauer commented 4 years ago

Following up on our discussion at https://github.com/zbateson/mail-mime-parser/issues/119, there's another problem I just encountered in the wild (notice the one-space indentation of To and From):

Date: 27 Jul 2020 09:13:49 +0200
 To: foo@example.com
 From: <no-reply@example.com>

Source: Misconfigured auto-responder of some Austrian company; they didn't fix it since June 10.

Your parser treats the 2 indented lines as part of the Date header - which is syntactically the correct behavior, see RFC 5322.

My first thought was: "This is hopeless. You cannot look for arbitrary header names in any other (folded) headers - that would become a mess!" But after thinking some more, I had two ideas how to deal with it after all:

  1. The Date header doesn't have soo many variations. So if a (valid) Date header is followed by a CRLF and a WSP, the following string is probably not part of the Date, but rather the next header.

  2. (Even better cause more general): There's a set of headers every email usually has (To, From, Subject, Message-ID,...). If the parser does not find all of them in their correct place, start looking for them in "weird" places. Rationale: If the email does not have a correct To header, but the string To: does appear in the Subject header, which is more probable:

    • That this is the To header
    • That the mail does not have a To header and the subject contains the string To:
zbateson commented 4 years ago

Sorry, but you'll have to demonstrate that this is widespread beyond an auto-responder from a single company, see the other thread :) I'm thinking I'll add this as a rule somewhere.

In this case, your fix will also result in breaking potentially valid emails, and as per the Moore's Law you mentioned, it could indeed at some point be a problem for someone somewhere, with what is syntactically valid and okay :)

Consider:

Date: 27 Jul 2020 09:13:49 +0200 (email is
  To: blah@blah.com and
  From: valid@still-valid.org
  Sent at some point)

Fixing every possible bad instance of an email, you can imagine an infinite number of "invalid" cases as mentioned in the other issue... but a valid email should (ideally) never fail... you have to be extra careful to not break what is actually valid (IMO).

ThomasLandauer commented 4 years ago

Sorry, but you'll have to demonstrate that this is widespread beyond an auto-responder from a single company

Well, this example nicely shows why I don't believe in your concept of "widespread" use: If I (or my client, or my users) are doing big business with that single company, their mails are important to me/us (even auto-replies can sometimes include some valuable information). So even if some issue is neglect-able from a global perspective, it might still be important locally.

In this case, your fix will also result in breaking potentially valid emails

No! In your code example: If there is a (valid) To: somewhere, it would take precedence! But if your 4 lines are the entire headers, I'd rather try to take blah@blah.com and as To:, rather than saying "This email doesn't have a To header".

zbateson commented 4 years ago

To: isn't a required header.

If I (or my client, or my users) are doing big business with that single company

But you're free to fix it for yourself in your code...

if ($message->getHeader('to') === null) {
    // do what needs to be done
}

Sorry, a library can't be written to accommodate edge-cases like that, it needs to exist outside :)

ThomasLandauer commented 4 years ago

To: isn't a required header.

Yeah, technically you're right, just Date and From are required, see RFC 5322. But 99.9% of all emails also have To, Message-ID, Subject ;-)

But you're free to fix it for yourself in your code

How could that look like in this case? Since it's a fundamental problem with header detection, your library wouldn't help me at all, and I'd have to parse the entire email string (the .eml file, so to say) from scratch?

Sorry, a library can't be written to accommodate edge-cases like that, it needs to exist outside :)

Well, those "community classes" I mentioned at https://github.com/zbateson/mail-mime-parser/issues/119#issuecomment-668548827 would be (sort of) outside. Besides, it would also make your own code cleaner if you'd refactor https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/Part/DatePart.php#L42-L49 out of the main parser to some error-handling class.

But let's wait how many other "fixable" errors I find, to see if it's worth it ;-)

zbateson commented 4 years ago

Yeah, like I said I'm not against it being a framework to handle cases like that... I seem to remember something like that in php-mime-mail-parser being created a year or two ago to handle such things if I recall correctly.

In your case, you know the correct header is part of the Date header... instead of me looking at the Date header, you could:

// if to header is empty, then
$dateHeader = $message->getHeader('date')->getRawValue();
// parse out the To: line, you could even send it to my parser if you grab the whole line:
$toLine = preg_match(....., $dateHeader); //, or whatever needed to extract the line...
$message->setRawHeader('to', $toLine);
echo $message->getHeader('to')->getEmail();
zbateson commented 4 years ago

But 99.9% of all emails also have To

Hahaha, I'd dispute that one also, I think it's reasonably common to have all recipients be BCC'd (many list emails would do this, which would make up a huge number of emails) :smile:

ThomasLandauer commented 4 years ago

->setRawHeader() – cool! :-)

Only drawback is, that if I want this to fix general indentation errors, I don't know that it's in the Date header. So I'd have to loop over all headers. But that's certainly a good starting point! :-)

I seem to remember something like that in php-mime-mail-parser being created a year or two ago

Yeah, I saw your conversation at https://github.com/zbateson/mail-mime-parser/issues/87. What's the status there? To be honest, I don't understand what those 3 GitLab repos are meant for...

zbateson commented 4 years ago

I think the plan was to create a common interface for mail parsers (like psr7 for streams), so one could choose to use php-mime-mail-parser or zbateson/mail-mime-parser by just changing what they're importing... and then separate parts like my header parsing out of this library and into its own I think as well so php-mime-mail-parser could use it too... something like that.

We chatted about it a bit and it kinda died off, no idea if he's still planning on creating those interfaces or not.

zbateson commented 4 years ago

Only drawback is, that if I want this to fix general indentation errors

To be honest, I'd take my approach and I wouldn't try, hahaha -- even for your case, if you can identify a specific case where it's indented and misplaced in Date, I'd just go with that preferably... if you can identify that it jumps around then maybe bother with writing code looking at 'general indentation' problems for a to header, but I'd avoid doing it if I could... it'll be difficult to have it done 'well' anyway (imho).

zbateson commented 4 years ago

My first approach would be to tell the Austrian company their emails are messed up which is probably causing issues with other clients too :)