Open ThomasLandauer opened 4 years ago
Yes, I think your examples are right -- there's also a convenience method in AddressHeader to get the email/name of the first header, so you could $message->getHeader('from')->getPersonName()
or ->getEmail()
directly if memory serves me right :stuck_out_tongue_closed_eyes: but that's besides the point you're making hehe.
Again, not entirely against it -- so would it display kind of like an email client shows email addresses? "Blah user@email, Second user@secondemail"?
Perhaps AddressHeader::getDisplayValue or something?
Perhaps AddressHeader::getDisplayValue or something?
No, sorry, I didn't point that out clear enough: This is a general header issue, not just about From
!
So if the email contains:
Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
There should be three functions to get:
Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
=?ISO-8859-1?Q?f=F6=F6b=E4r?=
fööbär
For the naming, I'm proposing a new system:
All functions should follow the pattern ->getHeader()->getSomething()
->getHeader()->getRawLine()
for 1. The word "line" indicates that you'll get the full line, including Subject:
. This would replace all occurrences of just ->getHeader()
in README.md
(cause Header
is an object, and relying on __toString()
is always "risky"). This would elegantly solve https://github.com/zbateson/mail-mime-parser/issues/123 :-)->getHeader()->getRawContent()
for 2. Same as ->getRawValue()
, but I'd rename it for consistency - see below.->getHeader()->getContent()
for 3.->getHeaderValue()
cause it's inconsistent with this naming system. Besides, its current behavior is inconsistent between Subject
and From
headers.->getValue()
is IMO dangerous/misleading, since for multiple addresses it returns just the first. So this should be dropped completely. ->getHeader()->getContent()
should return the entire decoded content (as string). And for each single address (or other part), keep ->getParts()
Just noticed that RFC 5322 uses the term "body" for what I called "content":
Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF.
So maybe ->getBody()
insted of ->getContent()
? But this could lead to confusion with the email's body...
I still don't see the value in changing __toString or it's meaning (the "string" representation of the header).
There should be three functions to get:
Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?= =?ISO-8859-1?Q?f=F6=F6b=E4r?= fööbär
But isn't there? __toString
, getRawValue
, getValue
I get that there's some inconsistency in getHeaderValue/getValue with addresses, but often for AddressHeader's you might only be interested in the first address, so I made it inconsistent so you didn't have to worry about AddressHeader potentially representing more than one address for a 'from' header for instance. I'm not sure what's worse though, both have their drawbacks imo :(
But yeah, that could've been a mistake. I'll give it some though. Maybe a convenient very explicit "getFirstAddress" "getFirstName" or something would've made more sense.
And yeah, if we're to change it we'd have to deprecate getHeaderValue and getValue and change them to getHeaderContent and getContent (or Body).
But isn't there? __toString, getRawValue, getValue
No, since for a To
header, getValue()
only returns (a) the first address, and (b) just the email part of that. This behavior is deceptive IMO, since it gives the false impression that the first address is all there is!
In which situations would somebody only be interested in the first address?
I get what you're saying about AddressHeader being inconsistent with its getValue and acknowledge that. In your example though, Subject, getValue would give result 3.
I can think of quite a few examples where you might only care about a single address... the primary being if you're checking 'From' for an address, is it preferable to go if (count($fromHeader->getAddresses()) > 0 && strcasecmp($from->getAddresses()[0]->getEmail(), 'email@example.com'))
or is there value in being able to do if (strcasecmp($from->getValue(), 'email@example.com')
?
Yeah, sure the second is shorter - but how can you know in advance that the address you're looking for must be the first one?
I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.
In your example though, Subject, getValue would give result 3.
Yeah, sorry, I missed that cause I was misled by the "errors" caused by https://github.com/zbateson/mail-mime-parser/issues/130
I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.
I was about to say that, but was afraid you'd come up with some mailing list (or whatever) example ;-)
Anyway: What about To
and Cc
?
The real point is: If getValue()
(or getHeaderValue()
) returns the entire content for most headers, it cannot be that it just returns just some part for some headers! That's really deceptive!
So what I'm mainly suggesting is to change the function's behavior. And since this is a BC break, you consequently need to change its name too. And this is a nice opportunity to rethink the entire naming concept.
I get it and am not disagreeing on this one as well. I feel like 'getValue' is so standard that it might be preferred even if there's a problem with it though... and the deprecation and introduction of 'getContent' also seems at least as equally annoying to me as the problem it's trying to fix. Everyone will have to change their calls from getValue to getContent, and they're all already used to the existing functionality.
Well, I for myself don't care how you handle the transition :-) Semantic versioning probably demands the deprecation since it's a BC break.
BTW: How could you "send" a deprecation warning message at all? And where/how could users see it?
About the very name: The more I think of it, the more I'm torn towards getBody(). Why? Because it's the term the RFC uses! And nobody was ever fired for sticking to the RFC... ;-)
The only deprecation I've done so far I've put up on the main readme (and main page of mail-mime-parser.org). I'd also put it up in the release notes for the specific version it changed in.
Here's the answer I was looking for: https://stackoverflow.com/a/194135/1668200
Aah, I see what you're getting at -- yeah, I didn't do that. I only added an @deprecated
annotation. That should be added to the deprecated functions I have too then :+1: even better in the comment on that one E_USER_DEPRECATED
was mentioned as well.
Two more points about the naming:
Just noticed that $message->getHeaderValue('content-type')
only returns text/plain
if the full header is Content-Type: text/plain;charset=utf-8
. So I think you're using getHeaderValue()
for something like "give me the most important info from this header". It is OK to have a convenience function like that. But it shouldn't be called "getXValue" Can't think of a better name (yet), cause I don't know what it gives for other headers ;-)
So are there more headers (besides To
, From
, and Content-Type
) where getHeaderValue()
doesn't return everything? If yes, I'd suggest to set up an overview table listing each header and each function and display what it returns. This will make it easier to find better names.
Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?
so I think you're using getHeaderValue() for something like "give me the most important info from this header"
Yeah, actually I think in that case it's different though too... if you consider the type of header Content-Type is, it's a header with optional additional 'parameters', so the value of the header is "text/plain", but it may have additional parameters like 'charset', which has a default value of 'us-ascii'... I think this is also different from my processing of emails which I agree is confusing.
You can see what I'm getting at here: RFC 1341 (I'm sure this must be mentioned in RFC822 and/or derivatives somewhere but I couldn't find it with a quick glance lol... anyway, specifically:
Content-Type header field value is defined as follows: Content-Type := type "/" subtype *[";" parameter]
Note that the 'parameter' is optional, and can be different depending on type/subtype -- for multipart, it could be 'boundary', for content-type it could be charset, etc... the 'value' is one thing, and the 'parameters' are another. This is true for other parameterized headers too, like 'content-disposition'... it's 'value' could be 'attachment', and it could optionally have a 'filename' parameter for example.
Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?
I think most people expect the comments to be removed in the value, they don't need to worry about that being explicitly stated imo.
the 'value' is one thing, and the 'parameters' are another
Well, another argument for "getBody()" :-)
I think most people ...
What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue()
I read it as "give me the stuff that's in there".
Comments:
A week ago I didn't even know that there is such a thing :-) The only encounter I had so far was the suffixed timezone name in the Date
header (e.g. (GMT)
) which sometimes causes problems, see https://github.com/php-mime-mail-parser/php-mime-mail-parser/issues/328
If that's the only occurrence of comments in the wild, there's indeed no need for a getBodyWithoutComments()
. Anybody interested in this piece of information could use getRawBody()
.
BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?=
in somewhere?
What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".
I'd say you're wrong about most people. Consider these examples:
https://javaee.github.io/javamail/docs/api/javax/mail/Header.html#getValue--
https://swiftmailer.symfony.com/docs/headers.html :
// note that value is one part, parameter is another
$headers->addParameterizedHeader(
'Header-Name', 'header value',
['foo' => 'bar']
);
The only instance of 'getBody' I've seen is here:
http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/Field.html#getBody()
And in this case it's like my getRawValue.
BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?
You should be able to do something like this if you want:
$di = new \ZBateson\Container();
$mlpf = $di->getMimeLiteralPartFactory();
$mlpf->newInstance('=?ISO-8859-1?Q?f=F6=F6b=E4r?=');
You could also create an instance of \ZBateson\MailMimeParser\Header\Part\MimeLiteralPart, pass it a \ZBateson\MbWrapper\MbWrapper instance as first parameter, and encoded word as second parameter. My example here though gives you options with dependency injection (in case future readers are doing that).
Yeah, sure "getBody" is uncommon. But what's your suggestion?? IMO getValue
is ruined now, since you sometimes use it indifferently (just give me whatever is in there), and sometimes you use it specifically (value vs. parameter vs. comment). So I'm guessing it's easier to educate people to use a new function, rather than understand that an existing did change.
Besides: Since you follow the RFCs where ever possible, why not take the names from them too?
I never "just give you whatever's in there"... it's never the equivalent of "getRawValue".
The only instance it's a little different is the way it gives the first email rather than 'all emails and names' in a string.
(unless the header doesn't require any processing of course)
How about I deprecate getValue only in AddressHeader, introduce a getCompleteValue (again just for AddressHeader) in a 1.3... then in 2.0 I deprecate getCompleteValue and restore getValue with the functionality of getCompleteValue.
It might affect a few people upgrading from an older version to the latest, but still less than deprecating all of 'getValue', introducing a new method like 'getBody' or 'getContent' which to me definitely implies the whole header instead of the value of it for something like a parameterized header, and also means most people can just be 'aware' of the change without changing every call to 'getValue' to something else.
I'd say I'm between that or living with the inconsistency, and I'll give it some thought.
(unless the header doesn't require any processing of course)
This is exactly the problem! To me, a header is a header. I don't know which headers need processing...
How about I deprecate getValue only in AddressHeader
And what about Content-Type
?
Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page (is this still possible on GitHub?) with a big table: All headers and all your functions; and in each cell, what the function returns. This will give us an overview which cases we need to cover. This table will ultimately move to README
, so it's no lost work but a great documentation for the future :-)
And maybe we should also get Vincent on board, cause you said that was the original idea behind https://gitlab.com/mailcare/parser ? Might take a while till we reach a consensus, but on the other hand: If you two agree on a common interface, this is going to be the de-facto standard for any PHP mail parsing project!
This is exactly the problem! To me, a header is a header. I don't know which headers need processing...
No, you misunderstood me...
Subject: this doesn't need any processing
Subject: =?utf-8?Q=but_this_does?=
I'm always processing headers, some look the same regardless though (first example)... in that case there's no difference between the two.
Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page
Yup, I can set it up for you if you want to undertake that :+1:
And maybe we should also get Vincent on board, cause you said that was the original idea behind
That's his project... he's not currently doing much with headers, I think he was more interested in using my header parser (when that was being discussed).
You can have a look at what type of Header object is created for what type of header here:
https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/HeaderFactory.php
The default is a "GenericHeader" which filters out comments and processes quoted parts.
Yup, I can set it up for you if you want to undertake that
OK, please!
I think most people expect the comments to be removed in the value
Some data I've just stumbled upon: For the Auto-Submitted
header (RFC 3834), adding a comment seems to be quite common: 16% of the emails I'm analyzing are having one. And it indeed adds some value (pun!) to the information:
auto-generated (confirmation) auto-replied (vacation)
The information in that case is still available via getRawValue if you want it that way of course, but parts are accessible separately too via getParts
https://mail-mime-parser.org/api/1.2/classes/ZBateson.MailMimeParser.Header.GenericHeader.html --
That returns all the parsed HeaderPart objects, including any CommentPart objects... although having a 'getComments' defined in AbstractHeader might make that easier.
Hi @ThomasLandauer --
I've had a chance to look at this some more and think about it and the behaviour isn't quite the way you described it.
'getValue' actually consistently returns the value of the first part of a header. Consider:
You can see that this is true in that AddressHeader doesn't override 'getValue', and neither does 'ParameterHeader'. The default implementation of getValue does this in AbstractHeader:
if (!empty($this->parts)) {
return $this->parts[0]->getValue();
}
return null;
This actually does make sense to me and I think it's still my preferred way... it allows a user to not have to handle knowing if an address contains more than one address in some cases like with a From. If the user expects more than one address, they can specifically handle it.
Having said that, it may be useful to have a convenience method to get all the names/addresses in an email field after decoding, that displays them like Thunderbird would or something. I'm a bit hesitant though because it has to make some decisions for the user -- how should I display this email address for example: "My <weird> Name" <email@example.com>
(or an address with mime-encoded non-ascii characters?)... I think what makes sense to me is to return the name always quoted, with non-ascii characters in mime encoded parts included in the quotes as well. It may not make sense or be useful to everyone though.
I know this is an old topic but I am having a similar issue. There seems to be no way to get the "full" decoded and unfolded contents of a header. For example:
This is the header as it appears in the rfc2822 message:
Authentication-Results: cf04.mxguardian.net;
dkim=pass (2048-bit key)
header.d=effectwebagency-com.20150623.gappssmtp.com
What I would like to retrieve is the full unfolded header without the header name, like so:
cf04.mxguardian.net; dkim=pass (2048-bit key) header.d=effectwebagency-com.20150623.gappssmtp.com
$result = $message->getHeader('Authentication-Results');
echo $result->getValue(),"\n\n";
echo $result->getRawValue(),"\n\n";
echo $result->__toString(),"\n\n";
cf04.mxguardian.net
cf04.mxguardian.net;
dkim=pass (2048-bit key)
header.d=effectwebagency-com.20150623.gappssmtp.com
Authentication-Results: cf04.mxguardian.net;
dkim=pass (2048-bit key)
header.d=effectwebagency-com.20150623.gappssmtp.com
As you can see, none of these methods returns the complete header on one line. It seems like my best option is to use getRawValue and do the unfolding myself, however, getRawValue doesn't do any MIME decoding. So if I'm working with a header that is MIME encoded then getRawValue doesn't work and I have to use a different workaround.
Can we have a method such as getDecodedAndUnfoldedValue
that works with any header?
I'm using version 2.1.0.
Thanks Kent
Hi @fkoyer,
Looks like you just need a quick regex for it: echo preg_replace('/\s+/', ' ', $result->getRawValue());
.
Edit: should really read things properly before responding, sorry.
Hi @zbateson,
Thanks for the quick response. Yes, I can definitely do the unfolding myself. It would just be handy to have a function that returns the full decoded and unfolded header text so that we don't have to do it every time we access a header. TBH, that's what I expected getValue()
to do. It's too late to change that now but another convenience method would be a welcome addition.
Thanks Kent
I'm done with
Date:
,From:
is next :-)I have this:
And I want to get this:
Currently, there's no easy way to get it, right?
echo $message->getHeader('from');
gives:See
AbstractHeader::__toString()
$message->getHeader('from')->getRawValue()
gives:$message->getHeaderValue('from')
gives:So all that's left is to use something like
->getAddresses()[0]->getName()
and->getAddresses()[0]->getEmail()
and reassemble them manually, right?php-mime-mail-parser/php-mime-mail-parser
has agetHeader()
which does exactly what I mean.So I'm suggesting that besides giving the raw header, or the deconstructed parts, you should offer an "intermediate" function that gives the semantically full header (i.e. full header decoded and unfolded).