Closed ThomasLandauer closed 4 years ago
Sorry, but you'll have to demonstrate that this is widespread beyond an auto-responder from a single company, see the other thread :) I'm thinking I'll add this as a rule somewhere.
In this case, your fix will also result in breaking potentially valid emails, and as per the Moore's Law you mentioned, it could indeed at some point be a problem for someone somewhere, with what is syntactically valid and okay :)
Consider:
Date: 27 Jul 2020 09:13:49 +0200 (email is
To: blah@blah.com and
From: valid@still-valid.org
Sent at some point)
Fixing every possible bad instance of an email, you can imagine an infinite number of "invalid" cases as mentioned in the other issue... but a valid email should (ideally) never fail... you have to be extra careful to not break what is actually valid (IMO).
Sorry, but you'll have to demonstrate that this is widespread beyond an auto-responder from a single company
Well, this example nicely shows why I don't believe in your concept of "widespread" use: If I (or my client, or my users) are doing big business with that single company, their mails are important to me/us (even auto-replies can sometimes include some valuable information). So even if some issue is neglect-able from a global perspective, it might still be important locally.
In this case, your fix will also result in breaking potentially valid emails
No! In your code example: If there is a (valid) To:
somewhere, it would take precedence! But if your 4 lines are the entire headers, I'd rather try to take blah@blah.com and
as To:
, rather than saying "This email doesn't have a To
header".
To: isn't a required header.
If I (or my client, or my users) are doing big business with that single company
But you're free to fix it for yourself in your code...
if ($message->getHeader('to') === null) {
// do what needs to be done
}
Sorry, a library can't be written to accommodate edge-cases like that, it needs to exist outside :)
To: isn't a required header.
Yeah, technically you're right, just Date
and From
are required, see RFC 5322. But 99.9% of all emails also have To
, Message-ID
, Subject
;-)
But you're free to fix it for yourself in your code
How could that look like in this case? Since it's a fundamental problem with header detection, your library wouldn't help me at all, and I'd have to parse the entire email string (the .eml
file, so to say) from scratch?
Sorry, a library can't be written to accommodate edge-cases like that, it needs to exist outside :)
Well, those "community classes" I mentioned at https://github.com/zbateson/mail-mime-parser/issues/119#issuecomment-668548827 would be (sort of) outside. Besides, it would also make your own code cleaner if you'd refactor https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/Part/DatePart.php#L42-L49 out of the main parser to some error-handling class.
But let's wait how many other "fixable" errors I find, to see if it's worth it ;-)
Yeah, like I said I'm not against it being a framework to handle cases like that... I seem to remember something like that in php-mime-mail-parser being created a year or two ago to handle such things if I recall correctly.
In your case, you know the correct header is part of the Date header... instead of me looking at the Date header, you could:
// if to header is empty, then
$dateHeader = $message->getHeader('date')->getRawValue();
// parse out the To: line, you could even send it to my parser if you grab the whole line:
$toLine = preg_match(....., $dateHeader); //, or whatever needed to extract the line...
$message->setRawHeader('to', $toLine);
echo $message->getHeader('to')->getEmail();
But 99.9% of all emails also have To
Hahaha, I'd dispute that one also, I think it's reasonably common to have all recipients be BCC'd (many list emails would do this, which would make up a huge number of emails) :smile:
->setRawHeader()
– cool! :-)
Only drawback is, that if I want this to fix general indentation errors, I don't know that it's in the Date
header. So I'd have to loop over all headers. But that's certainly a good starting point! :-)
I seem to remember something like that in php-mime-mail-parser being created a year or two ago
Yeah, I saw your conversation at https://github.com/zbateson/mail-mime-parser/issues/87. What's the status there? To be honest, I don't understand what those 3 GitLab repos are meant for...
I think the plan was to create a common interface for mail parsers (like psr7 for streams), so one could choose to use php-mime-mail-parser or zbateson/mail-mime-parser by just changing what they're importing... and then separate parts like my header parsing out of this library and into its own I think as well so php-mime-mail-parser could use it too... something like that.
We chatted about it a bit and it kinda died off, no idea if he's still planning on creating those interfaces or not.
Only drawback is, that if I want this to fix general indentation errors
To be honest, I'd take my approach and I wouldn't try, hahaha -- even for your case, if you can identify a specific case where it's indented and misplaced in Date, I'd just go with that preferably... if you can identify that it jumps around then maybe bother with writing code looking at 'general indentation' problems for a to header, but I'd avoid doing it if I could... it'll be difficult to have it done 'well' anyway (imho).
My first approach would be to tell the Austrian company their emails are messed up which is probably causing issues with other clients too :)
Following up on our discussion at https://github.com/zbateson/mail-mime-parser/issues/119, there's another problem I just encountered in the wild (notice the one-space indentation of
To
andFrom
):Source: Misconfigured auto-responder of some Austrian company; they didn't fix it since June 10.
Your parser treats the 2 indented lines as part of the
Date
header - which is syntactically the correct behavior, see RFC 5322.My first thought was: "This is hopeless. You cannot look for arbitrary header names in any other (folded) headers - that would become a mess!" But after thinking some more, I had two ideas how to deal with it after all:
The
Date
header doesn't have soo many variations. So if a (valid) Date header is followed by a CRLF and a WSP, the following string is probably not part of the Date, but rather the next header.(Even better cause more general): There's a set of headers every email usually has (To, From, Subject, Message-ID,...). If the parser does not find all of them in their correct place, start looking for them in "weird" places. Rationale: If the email does not have a correct
To
header, but the stringTo:
does appear in theSubject
header, which is more probable:To
headerTo
header and the subject contains the stringTo: