zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
458 stars 58 forks source link

Bug in address parser #165

Closed postme closed 3 years ago

postme commented 3 years ago

Hi Zaahid, hope you're doing well! I found a bug in the address parser, I created a little test case:

<?php

declare(strict_types=1);

require __DIR__ . '/vendor/autoload.php';

use ZBateson\MailMimeParser\Message;

$emailString = <<<EOD
From: "Test Sender" <test@sender.org>
To: "Test Recipient" <test@recipient.org>, "Test Recipient" <test@recipient.org:>, "Test Recipient" <,test@recipient.org>
Subject: Test
Date: Tue, 23 Feb 2021 19:01:51 +0000
Message-ID:
 <AM8P189MB142828DEF87769DE5F072D7BFF809@AM8P189MB1428.EURP189.PROD.OUTLOOK.COM>
MIME-Version: 1.0

Test
EOD;

/* Parse the e-mail and create a ZBateson/MailMimeParser object */
$email = Message::from($emailString);

if (null !== $email->getHeader('To')) {
    $toAddresses = $email->getHeader('To')->getAddresses();
}

var_dump($toAddresses);

This yields:

Address 2 and 3 are incorrect, address 2 yields an empty string, address 3 yields a corrected string. Ideally the address parser should throw an error if it can't correctly process the address?

P.S. I have resolved the issue for now by getting the raw values of To, Cc and Bcc and parsing them with egulias/emailvalidator so no rush here.

Kind regards Meint

zbateson commented 3 years ago

Hi Meint,

Nice to hear from you! Yes, doing well thank you :)

The issue you're experiencing (off the top of my head) is because top level for an address line is the AddressGroupConsumer. This means ':' takes precedence over actual addresses, so something in the form 'To: mylist: addr1@example.com, addr2@example.com; another-recipient@example.com' correctly parses the 'mylist' part as a group. It's definitely a less commonly-used feature, but part of the standards nonetheless. It looks out for ':' characters to start a group, and ';' as a separator.

There may be a way to solve this by being more intuitive about ignoring the ':' part once in an address... I'll investigate this more thoroughly to figure out what can be done about it.

I've generally avoided throwing errors in favour of 'returning something' on a best-effort basis, kind of like an email client would do... I do like the idea proposed in #124 though. The trouble is, for something like this issue, without additional code to validate it probably wouldn't cause an error anyway -- and I'm not sure the additional validation is worth it.

All the best, Zaahid

postme commented 3 years ago

Hi Zaahid happy to hear you're doing well! Thanks for explaining the behaviour, I understand the reasoning, ticket #124 would definitely be a nice way to go about it, i.e. returning something sane but still have an opportunity to inspect parser issues and decide on a different interpretation.

What's not entirely clear to me is whether the domain part of the example I gave, i.e. test@recipient.org: (with a colon at the end) would qualify as a valid domain? RFC 5322 states that the domain part is of type dtext and that can mean a whole range of characters (including colon) however it also states that it needs to constitute a real address, ie. resolvable and from that perspective I don't think that recipient.org: qualifies?

Kind regards Meint

zbateson commented 3 years ago

I think you're right -- it wouldn't be a correct domain. For MMP's purposes though, I think that validation is out of scope -- as in it'll "return what's there" to the best of its abilities, but the validation for what constitutes correctness is up to the user. At least that's been my working policy so far.

zbateson commented 3 years ago

I see your point though too about the spec, the 'colon' should be handled normally. Thanks for pointing that out as well Meint :)

postme commented 3 years ago

Hi Zaahid, you're welcome, I'm closing the ticket.

For anybody running into the same (or similar) issue, the solution is to retrieve the raw values like this:


$addresses = '';
foreach ($email->getAllHeaders() as $header) {
    if (strtolower($header->getName()) === 'to' || strtolower($header->getName()) === 'cc' || strtolower($header->getName()) === 'bcc') {
        $addresses .= $header->getRawValue() . ', ';
    }
}
$addresses = trim($addresses, ', ');

This will yield a string containing all recipients addresses which can be parsed by an email address validator.

zbateson commented 2 years ago

A fix for this has been released in 2.1.0 that addresses this in MMP directly -- a separate consumer for the email address portion of an address solves it, and it should've been done that way all along :)