zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
441 stars 56 forks source link

feature request #224

Closed postme closed 8 months ago

postme commented 10 months ago

Hi Zaahid,

This is an issue I have tried to resolve for the last couple of years but have never been successful in finding a definitive solution for.

Consider the following display name (yes, this really happens, it's a test case from an actual email):

To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\\"Test, Recipient\\\"\"'" <success5@simulator.amazonses.com>,

MailMimeParser will resolve this to:

One of them is a mangled, non-routeable address: <"'\"\"Test> which causes issues in the system when trying to send the email.

Right now I'm doing a cleanup before having the email addresses parsed/resolved by MailMimeParser but its using regexes and its not covering all the bases.

Is there a way to either cleanup the display names (its always issues in the display name) in the parser or can you throw an exception if the parsing fails? In the latter situation I can at least have an alternative action.

Right now I have silent failures of the parser and I only notice the errors downstream when I'm trying to send a mangled address.

Any way you see to resolve this issue?

Kind regards Meint

postme commented 10 months ago

Ok little update, I've been able to get good results by preparsing the email addresses with mmucklo/email-parse which is able to process all test cases correctly after which I'm setting the cleansed/validated email headers back into the email via setRawHeader. Maybe an idea to integrate this lib into mailmimeparser for the email address parsing part?

zbateson commented 10 months ago

Hi @postme --

Yeah, let me take a look... so long as it's valid it should be fine and it shouldn't give incorrect results, so first glance looks like there's a bug there.

zbateson commented 10 months ago

Hi @postme,

So this seems to work fine for me. I've created a few tests to confirm in https://github.com/zbateson/mail-mime-parser/commit/9f76ca7b9bbe7133a6399dba7887f9e1bf1f0017 -- I am able to get your results, but only if I test with a string that's incorrectly escaped.

So:

$s = '"\'\"\\\"Test, Recipient\\\"\"\'" <success5@simulator.amazonses.com>'

causes exactly the problem you're getting if I try to parse it in either AddressHeaderTest or QuotedStringConsumerTest, but if I correctly escape the backslashes (accounting for PHP needing them to be escaped), i.e.:

$s = '"\'\"\\\\\"Test, Recipient\\\\\"\"\'" <success5@simulator.amazonses.com>'

It works correctly. The same if I test in an actual email. On the test branch, I created tests/_data/emails/github-224.txt and that parses correctly and succeeds.

Anyways, hope that helps :) my guess is I'm just fixing your test, and the issue you found is something else probably! Anyways, will be watching for your reply and hopefully we can get to the bottom of it.

All the best, Zaahid

postme commented 10 months ago

Hi @zbateson,

Yeah that's the thing, I shouldn't be getting unbalanced/incorrectly escaped address strings in the emails but I'm getting them nonetheless :-) My guess its not so much the actual mail clients or mail servers generating this but downstream anti-virus/mailware/dlp scanners doing this (in my experience its the additional mail infra thats usually generating cr*p).

I've been trying to clean them up with regexes, but like the saying goes "you had one problem and you used regex to solve it, now you have two problems". Interestingly the mmucklo package is able to deal with the various messy test cases, thats why I suggested it. I'm more than willing to fund a special project to get to a error resistant, standards compliant rfc5322 email address parser if you would be so inclined :-)

Cheers Meint

zbateson commented 10 months ago

Hi @postme,

I think I confused the issue, sorry... I'm saying I too am parsing your example correctly. My parsing is also meant to be rfc-compliant, so if there's an issue I'm happy to fix it :)

Given the email:

test.eml:

To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\\"Test, Recipient\\\"\"'" <success5@simulator.amazonses.com>,

test

This works:

use ZBateson\MailMimeParser\Message;
$message = Message::from(fopen('test.eml', 'r'), true);
$header = $message->getHeader('To');
$addrs = $header->getAddresses();

echo 'MMP: ', $addrs[0]->getName(), ' ', $addrs[0]->getEmail(), ' ', 
  $addrs[1]->getName(), ' ', $addrs[1]->getEmail(), "\n";

Result:

MMP: 'Test Recipient' "success1"@simulator.amazonses.com '"\"Test, Recipient\""' success5@simulator.amazonses.com
postme commented 9 months ago

Hi Zaahid,

  As promised I spent a bit of time with the issue today and on my side the example code fails. I use the following code, it’s all self contained so should work on your side as well

  $email = <<<TEXT

To: "'Test Recipient'" @.>, "'\"\\"Test, Recipient\\"\"'" @.>

From: "Test Sender" @.***>

Subject: Test

Message-ID: @.***>

Date: Fri, 17 Apr 2020 17:54:22 +0200

MIME-Version: 1.0

  This is a test message.

TEXT;

        $message = Message::from(resource: $email, attached: true);

        $header = $message->getHeader('To');

        $addrs = $header->getAddresses();

        echo 'MMP: ', $addrs[0]->getName(), ' ', $addrs[0]->getEmail(), ' ', $addrs[1]->getName(), ' ', $addrs[1]->getEmail(), "\n";

    }

    Gives output:

  MMP: 'Test Recipient' @.***  "'\"\"Test

  Can you reproduce this output or do we have a weird environment difference somehow?

  Kind regards

Meint

        From: Zaahid Bateson @.> Sent: donderdag 7 september 2023 17:29 To: zbateson/mail-mime-parser @.> Cc: Meint Post @.>; Mention @.> Subject: Re: [zbateson/mail-mime-parser] feature request (Issue #224)

  Hi @postme https://github.com/postme ,

I think I confused the issue, sorry... I'm saying I too am parsing your example correctly. My parsing is also meant to be rfc-compliant, so if there's an issue I'm happy to fix it :)

Given the email:

test.eml:

To: "'Test Recipient'" @. @.> >, "'\"\\"Test, Recipient\\"\"'" @. @.> >,

 

test

This works:

use ZBateson\MailMimeParser\Message;

$message = Message::from(fopen('test.eml', 'r'), true);

$header = $message->getHeader('To');

$addrs = $header->getAddresses();

 

echo 'MMP: ', $addrs[0]->getName(), ' ', $addrs[0]->getEmail(), ' ',

  $addrs[1]->getName(), ' ', $addrs[1]->getEmail(), "\n";

— Reply to this email directly, view it on GitHub https://github.com/zbateson/mail-mime-parser/issues/224#issuecomment-1710357902 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQR2VP26UBVHW4RHLIIWDXZHR4ZANCNFSM6AAAAAA4NDRUUI . You are receiving this because you were mentioned.Message ID: @. @.> >

zbateson commented 9 months ago

Hi Meint,

I think github filtered out a bunch of what it thought were email addresses from your message, but I think it's effectively this:

$email = <<<TEXT
To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\\"Test, Recipient\\\"\"'" <success5@simulator.amazonses.com>,

This is a test message.
TEXT;

$message = Message::from($email, true);
$header = $message->getHeader('To');
$addrs = $header->getAddresses();

echo 'MMP: ', $addrs[0]->getName(), ' ', $addrs[0]->getEmail(), ' ', $addrs[1]->getName(), ' ', $addrs[1]->getEmail(), "\n";

Output for it is: MMP: 'Test Recipient' "success1"@simulator.amazonses.com "'\"\\"Test

But again, this is where PHP's strings might be getting you -- it still allows escaping backslashes in HEREDOC (to escape variable names like \${})... so if you echo $email, it comes out like this:

To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\"Test, Recipient\\"\"'" <success5@simulator.amazonses.com>,

This is a test message.

Notice the difference in the backslashes around the name of the second address -- original in php:

To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\\"Test, Recipient\\\"\"'" <success5@simulator.amazonses.com>,

Output version:

To: "'Test Recipient'" <"success1"@simulator.amazonses.com>, "'\"\\"Test, Recipient\\"\"'" <success5@simulator.amazonses.com>,

The reason that's important and why I used an external file, is I need to be able to trust that a quoted comma is okay. If we have an email that's:

"Test\", Recipient" <blah@blah.com>

or some weird combination of backslashes/commas, I have to be able to trust the provided structure of quoted parts including any escaped characters within, otherwise things that should work will break.

I think if you go back to the original email that failed, assuming it changed a bit to create a test case, it might help to copy the contents of that original email to a text file and test with that to narrow down what the issue is? Then simplify the email using the text file to avoid any of the backslash escaping madness that might come with pasting in a php string.

Otherwise, if the issue is in fact what you've pasted, as in you've provided the string already escaped for php HEREDOC, I'm not sure what I could do on my end that wouldn't break other actually valid, but admittedly weird, cases :blush:. I'm all ears though if there's a solution you guys want me to consider!

postme commented 9 months ago

Hi Zaahid,

  Ok it certainly took its time before I grasped what you’ve been saying for the last couple of emails but it finally landed 😊

I will rewrite all my testcase to use an external email file rather than using HEREDOC, if there’s anything in there that is not explained by HEREDOC behavior I will let you know (but now that the penny has dropped I don’t expect any issues)

Thanks for taking the time to explain it, I never contemplated HEREDOC behavior as the culprit.

  Kind regards

Meint

    From: Zaahid Bateson @.> Sent: vrijdag 15 september 2023 18:23 To: zbateson/mail-mime-parser @.> Cc: Meint Post @.>; Mention @.> Subject: Re: [zbateson/mail-mime-parser] feature request (Issue #224)

  Hi Meint,

I think github filtered out a bunch of what it thought were email addresses from your message, but I think it's effectively this:

$email = <<<TEXT

To: "'Test Recipient'" @. @.> >, "'\"\\"Test, Recipient\\"\"'" @. @.> >,

This is a test message.

TEXT;

 

$message = Message::from($email, true);

$header = $message->getHeader('To');

$addrs = $header->getAddresses();

 

echo 'MMP: ', $addrs[0]->getName(), ' ', $addrs[0]->getEmail(), ' ', $addrs[1]->getName(), ' ', $addrs[1]->getEmail(), "\n";

Output for it is: MMP: 'Test Recipient' @. @.> "'\"\"Test

But again, this is where PHP's strings might be getting you -- it still allows escaping backslashes in HEREDOC (to escape variable names like ${})... so if you echo $email, it comes out like this:

To: "'Test Recipient'" @. @.> >, "'\"\"Test, Recipient\"\"'" @. @.> >,

This is a test message.

Notice the difference in the backslashes around the name of the second address -- original in php:

To: "'Test Recipient'" @. @.> >, "'\"\\"Test, Recipient\\"\"'" @. @.> >,

Output version:

To: "'Test Recipient'" @. @.> >, "'\"\"Test, Recipient\"\"'" @. @.> >,

The reason that's important and why I used an external file, is I need to be able to trust that a quoted comma is okay. If we have an email that's:

"Test\", Recipient" @. @.> >

or some weird combination of backslashes/commas, I have to be able to trust the provided structure of quoted parts including any escaped characters within, otherwise things that should work will break.

I think if you go back to the original email that failed, assuming it changed a bit to create a test case, it might help to copy the contents of that original email to a text file and test with that to narrow down what the issue is? Then simplify the email using the text file to avoid any of the backslash escaping madness that might come with pasting in a php string.

Otherwise, if the issue is in fact what you've pasted, as in you've provided the string already escaped for php HEREDOC, I'm not sure what I could do on my end that wouldn't break other actually valid, but admittedly weird, cases 😊. I'm all ears though if there's a solution you guys want me to consider!

— Reply to this email directly, view it on GitHub https://github.com/zbateson/mail-mime-parser/issues/224#issuecomment-1721541458 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQR2U5YHD7JOKLPJSZYKDX2R6FXANCNFSM6AAAAAA4NDRUUI . You are receiving this because you were mentioned.Message ID: @. @.> >

zbateson commented 9 months ago

No problem at all, I'm around if you discover anything :)