zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
456 stars 57 forks source link

Get decoded/unfolded semantically "full" header #122

Open ThomasLandauer opened 4 years ago

ThomasLandauer commented 4 years ago

I'm done with Date:, From: is next :-)

I have this:

From: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
 <company@example.com>

And I want to get this:

fööbär <company@example.com>

Currently, there's no easy way to get it, right?

php-mime-mail-parser/php-mime-mail-parser has a getHeader() which does exactly what I mean.

So I'm suggesting that besides giving the raw header, or the deconstructed parts, you should offer an "intermediate" function that gives the semantically full header (i.e. full header decoded and unfolded).

zbateson commented 4 years ago

Yes, I think your examples are right -- there's also a convenience method in AddressHeader to get the email/name of the first header, so you could $message->getHeader('from')->getPersonName() or ->getEmail() directly if memory serves me right :stuck_out_tongue_closed_eyes: but that's besides the point you're making hehe.

Again, not entirely against it -- so would it display kind of like an email client shows email addresses? "Blah user@email, Second user@secondemail"?

Perhaps AddressHeader::getDisplayValue or something?

ThomasLandauer commented 4 years ago

Perhaps AddressHeader::getDisplayValue or something?

No, sorry, I didn't point that out clear enough: This is a general header issue, not just about From!

So if the email contains:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=

There should be three functions to get:

  1. Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
  2. =?ISO-8859-1?Q?f=F6=F6b=E4r?=
  3. fööbär

For the naming, I'm proposing a new system:

All functions should follow the pattern ->getHeader()->getSomething()

ThomasLandauer commented 4 years ago

Just noticed that RFC 5322 uses the term "body" for what I called "content":

Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF.

So maybe ->getBody() insted of ->getContent()? But this could lead to confusion with the email's body...

zbateson commented 4 years ago

I still don't see the value in changing __toString or it's meaning (the "string" representation of the header).

There should be three functions to get:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?= =?ISO-8859-1?Q?f=F6=F6b=E4r?= fööbär

But isn't there? __toString, getRawValue, getValue

I get that there's some inconsistency in getHeaderValue/getValue with addresses, but often for AddressHeader's you might only be interested in the first address, so I made it inconsistent so you didn't have to worry about AddressHeader potentially representing more than one address for a 'from' header for instance. I'm not sure what's worse though, both have their drawbacks imo :(

But yeah, that could've been a mistake. I'll give it some though. Maybe a convenient very explicit "getFirstAddress" "getFirstName" or something would've made more sense.

And yeah, if we're to change it we'd have to deprecate getHeaderValue and getValue and change them to getHeaderContent and getContent (or Body).

ThomasLandauer commented 4 years ago

But isn't there? __toString, getRawValue, getValue

No, since for a To header, getValue() only returns (a) the first address, and (b) just the email part of that. This behavior is deceptive IMO, since it gives the false impression that the first address is all there is!

In which situations would somebody only be interested in the first address?

zbateson commented 4 years ago

I get what you're saying about AddressHeader being inconsistent with its getValue and acknowledge that. In your example though, Subject, getValue would give result 3.

I can think of quite a few examples where you might only care about a single address... the primary being if you're checking 'From' for an address, is it preferable to go if (count($fromHeader->getAddresses()) > 0 && strcasecmp($from->getAddresses()[0]->getEmail(), 'email@example.com')) or is there value in being able to do if (strcasecmp($from->getValue(), 'email@example.com')?

ThomasLandauer commented 4 years ago

Yeah, sure the second is shorter - but how can you know in advance that the address you're looking for must be the first one?

zbateson commented 4 years ago

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

ThomasLandauer commented 4 years ago

In your example though, Subject, getValue would give result 3.

Yeah, sorry, I missed that cause I was misled by the "errors" caused by https://github.com/zbateson/mail-mime-parser/issues/130

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

I was about to say that, but was afraid you'd come up with some mailing list (or whatever) example ;-)

Anyway: What about To and Cc?
The real point is: If getValue() (or getHeaderValue()) returns the entire content for most headers, it cannot be that it just returns just some part for some headers! That's really deceptive!

So what I'm mainly suggesting is to change the function's behavior. And since this is a BC break, you consequently need to change its name too. And this is a nice opportunity to rethink the entire naming concept.

zbateson commented 4 years ago

I get it and am not disagreeing on this one as well. I feel like 'getValue' is so standard that it might be preferred even if there's a problem with it though... and the deprecation and introduction of 'getContent' also seems at least as equally annoying to me as the problem it's trying to fix. Everyone will have to change their calls from getValue to getContent, and they're all already used to the existing functionality.

ThomasLandauer commented 4 years ago

Well, I for myself don't care how you handle the transition :-) Semantic versioning probably demands the deprecation since it's a BC break.

BTW: How could you "send" a deprecation warning message at all? And where/how could users see it?

About the very name: The more I think of it, the more I'm torn towards getBody(). Why? Because it's the term the RFC uses! And nobody was ever fired for sticking to the RFC... ;-)

zbateson commented 4 years ago

The only deprecation I've done so far I've put up on the main readme (and main page of mail-mime-parser.org). I'd also put it up in the release notes for the specific version it changed in.

ThomasLandauer commented 4 years ago

Here's the answer I was looking for: https://stackoverflow.com/a/194135/1668200

zbateson commented 4 years ago

Aah, I see what you're getting at -- yeah, I didn't do that. I only added an @deprecated annotation. That should be added to the deprecated functions I have too then :+1: even better in the comment on that one E_USER_DEPRECATED was mentioned as well.

ThomasLandauer commented 4 years ago

Two more points about the naming:

zbateson commented 4 years ago

so I think you're using getHeaderValue() for something like "give me the most important info from this header"

Yeah, actually I think in that case it's different though too... if you consider the type of header Content-Type is, it's a header with optional additional 'parameters', so the value of the header is "text/plain", but it may have additional parameters like 'charset', which has a default value of 'us-ascii'... I think this is also different from my processing of emails which I agree is confusing.

You can see what I'm getting at here: RFC 1341 (I'm sure this must be mentioned in RFC822 and/or derivatives somewhere but I couldn't find it with a quick glance lol... anyway, specifically:

Content-Type header field value is defined as follows: Content-Type := type "/" subtype *[";" parameter]

Note that the 'parameter' is optional, and can be different depending on type/subtype -- for multipart, it could be 'boundary', for content-type it could be charset, etc... the 'value' is one thing, and the 'parameters' are another. This is true for other parameterized headers too, like 'content-disposition'... it's 'value' could be 'attachment', and it could optionally have a 'filename' parameter for example.

Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?

I think most people expect the comments to be removed in the value, they don't need to worry about that being explicitly stated imo.

ThomasLandauer commented 4 years ago

the 'value' is one thing, and the 'parameters' are another

Well, another argument for "getBody()" :-)

I think most people ...

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

Comments: A week ago I didn't even know that there is such a thing :-) The only encounter I had so far was the suffixed timezone name in the Date header (e.g. (GMT)) which sometimes causes problems, see https://github.com/php-mime-mail-parser/php-mime-mail-parser/issues/328
If that's the only occurrence of comments in the wild, there's indeed no need for a getBodyWithoutComments(). Anybody interested in this piece of information could use getRawBody().

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

zbateson commented 4 years ago

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

I'd say you're wrong about most people. Consider these examples:

https://javaee.github.io/javamail/docs/api/javax/mail/Header.html#getValue--

https://swiftmailer.symfony.com/docs/headers.html :

// note that value is one part, parameter is another
$headers->addParameterizedHeader(
  'Header-Name', 'header value',
  ['foo' => 'bar']
  );

The only instance of 'getBody' I've seen is here:

http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/Field.html#getBody()

And in this case it's like my getRawValue.

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

You should be able to do something like this if you want:

$di = new \ZBateson\Container();
$mlpf = $di->getMimeLiteralPartFactory();
$mlpf->newInstance('=?ISO-8859-1?Q?f=F6=F6b=E4r?=');

You could also create an instance of \ZBateson\MailMimeParser\Header\Part\MimeLiteralPart, pass it a \ZBateson\MbWrapper\MbWrapper instance as first parameter, and encoded word as second parameter. My example here though gives you options with dependency injection (in case future readers are doing that).

ThomasLandauer commented 4 years ago

Yeah, sure "getBody" is uncommon. But what's your suggestion?? IMO getValue is ruined now, since you sometimes use it indifferently (just give me whatever is in there), and sometimes you use it specifically (value vs. parameter vs. comment). So I'm guessing it's easier to educate people to use a new function, rather than understand that an existing did change.

Besides: Since you follow the RFCs where ever possible, why not take the names from them too?

zbateson commented 4 years ago

I never "just give you whatever's in there"... it's never the equivalent of "getRawValue".

The only instance it's a little different is the way it gives the first email rather than 'all emails and names' in a string.

zbateson commented 4 years ago

(unless the header doesn't require any processing of course)

zbateson commented 4 years ago

How about I deprecate getValue only in AddressHeader, introduce a getCompleteValue (again just for AddressHeader) in a 1.3... then in 2.0 I deprecate getCompleteValue and restore getValue with the functionality of getCompleteValue.

It might affect a few people upgrading from an older version to the latest, but still less than deprecating all of 'getValue', introducing a new method like 'getBody' or 'getContent' which to me definitely implies the whole header instead of the value of it for something like a parameterized header, and also means most people can just be 'aware' of the change without changing every call to 'getValue' to something else.

I'd say I'm between that or living with the inconsistency, and I'll give it some thought.

ThomasLandauer commented 4 years ago

(unless the header doesn't require any processing of course)

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

How about I deprecate getValue only in AddressHeader

And what about Content-Type?

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page (is this still possible on GitHub?) with a big table: All headers and all your functions; and in each cell, what the function returns. This will give us an overview which cases we need to cover. This table will ultimately move to README, so it's no lost work but a great documentation for the future :-)

And maybe we should also get Vincent on board, cause you said that was the original idea behind https://gitlab.com/mailcare/parser ? Might take a while till we reach a consensus, but on the other hand: If you two agree on a common interface, this is going to be the de-facto standard for any PHP mail parsing project!

zbateson commented 4 years ago

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

No, you misunderstood me...

Subject: this doesn't need any processing
Subject: =?utf-8?Q=but_this_does?=

I'm always processing headers, some look the same regardless though (first example)... in that case there's no difference between the two.

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page

Yup, I can set it up for you if you want to undertake that :+1:

And maybe we should also get Vincent on board, cause you said that was the original idea behind

That's his project... he's not currently doing much with headers, I think he was more interested in using my header parser (when that was being discussed).

zbateson commented 4 years ago

You can have a look at what type of Header object is created for what type of header here:

https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/HeaderFactory.php

The default is a "GenericHeader" which filters out comments and processes quoted parts.

ThomasLandauer commented 4 years ago

Yup, I can set it up for you if you want to undertake that

OK, please!

ThomasLandauer commented 4 years ago

I think most people expect the comments to be removed in the value

Some data I've just stumbled upon: For the Auto-Submitted header (RFC 3834), adding a comment seems to be quite common: 16% of the emails I'm analyzing are having one. And it indeed adds some value (pun!) to the information:

auto-generated (confirmation) auto-replied (vacation)

zbateson commented 4 years ago

The information in that case is still available via getRawValue if you want it that way of course, but parts are accessible separately too via getParts https://mail-mime-parser.org/api/1.2/classes/ZBateson.MailMimeParser.Header.GenericHeader.html --

That returns all the parsed HeaderPart objects, including any CommentPart objects... although having a 'getComments' defined in AbstractHeader might make that easier.

zbateson commented 3 years ago

Hi @ThomasLandauer --

I've had a chance to look at this some more and think about it and the behaviour isn't quite the way you described it.

'getValue' actually consistently returns the value of the first part of a header. Consider:

You can see that this is true in that AddressHeader doesn't override 'getValue', and neither does 'ParameterHeader'. The default implementation of getValue does this in AbstractHeader:

if (!empty($this->parts)) {
    return $this->parts[0]->getValue();
}
return null;

This actually does make sense to me and I think it's still my preferred way... it allows a user to not have to handle knowing if an address contains more than one address in some cases like with a From. If the user expects more than one address, they can specifically handle it.

Having said that, it may be useful to have a convenience method to get all the names/addresses in an email field after decoding, that displays them like Thunderbird would or something. I'm a bit hesitant though because it has to make some decisions for the user -- how should I display this email address for example: "My <weird> Name" <email@example.com> (or an address with mime-encoded non-ascii characters?)... I think what makes sense to me is to return the name always quoted, with non-ascii characters in mime encoded parts included in the quotes as well. It may not make sense or be useful to everyone though.

fkoyer commented 2 years ago

I know this is an old topic but I am having a similar issue. There seems to be no way to get the "full" decoded and unfolded contents of a header. For example:

Header

This is the header as it appears in the rfc2822 message:

Authentication-Results: cf04.mxguardian.net;
    dkim=pass (2048-bit key)
    header.d=effectwebagency-com.20150623.gappssmtp.com

Desired output

What I would like to retrieve is the full unfolded header without the header name, like so:

cf04.mxguardian.net; dkim=pass (2048-bit key) header.d=effectwebagency-com.20150623.gappssmtp.com

Code

$result = $message->getHeader('Authentication-Results');
echo $result->getValue(),"\n\n";
echo $result->getRawValue(),"\n\n";
echo $result->__toString(),"\n\n";

Output

cf04.mxguardian.net

cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

Authentication-Results: cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

As you can see, none of these methods returns the complete header on one line. It seems like my best option is to use getRawValue and do the unfolding myself, however, getRawValue doesn't do any MIME decoding. So if I'm working with a header that is MIME encoded then getRawValue doesn't work and I have to use a different workaround.

Can we have a method such as getDecodedAndUnfoldedValue that works with any header?

I'm using version 2.1.0.

Thanks Kent

zbateson commented 2 years ago

Hi @fkoyer,

Looks like you just need a quick regex for it: echo preg_replace('/\s+/', ' ', $result->getRawValue());.

Edit: should really read things properly before responding, sorry.

fkoyer commented 2 years ago

Hi @zbateson,

Thanks for the quick response. Yes, I can definitely do the unfolding myself. It would just be handy to have a function that returns the full decoded and unfolded header text so that we don't have to do it every time we access a header. TBH, that's what I expected getValue() to do. It's too late to change that now but another convenience method would be a welcome addition.

Thanks Kent