zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
442 stars 56 forks source link

Issue reading content encoded as Windows-1258 #141

Closed Lepelley closed 3 years ago

Lepelley commented 3 years ago

I retrieve mails from Gmail API using your library and for some cases (like less that 3%), it returns some characters, but not all of them, on PHP 5.4.16, but returns everything on 7.4.6.

<?php $decodedMail = "mime string"; $mime = Message::from($decodedMail); echo $mime->getHtmlContent();

Do you have some ideas that can cause that difference ? We are bound to upgrade PHP version, but i'm not sure to force my boss to do that yet.

zbateson commented 3 years ago

Hi @Lepelley

Could you confirm which version of mail-mime-parser you're using? Some old versions had an issue with base64 decoding that was doing that using php's built-in decoding, so I had switched to my own based on psr7 streams with guzzlehttp... I don't think that issue was specific to 5.4, and unfortunately can't think of anything else that may be causing that.

Otherwise -- if it's not a version issue... it would be very helpful if you could narrow it down to an email and see if a test could be written based on it so we can fix it.

Lepelley commented 3 years ago

I was using the 1.2.0 version, but i also tried with the 1.2.3. The email i got the error with (anonymised some data) :

Deleted

zbateson commented 3 years ago

Can you confirm it's the base64 encoded image part that the issue happens on?

Lepelley commented 3 years ago

My problem is that the content of the mail is truncated, not sure if it's cause of the base64 image.

zbateson commented 3 years ago

Hmm, so it could be an issue with quoted-printable... the content as in specifically the text part or the html part or both?

Lepelley commented 3 years ago

Both

zbateson commented 3 years ago

Hi @Lepelley

Sorry for the delay looking at this. This is actually happening to me on php 7.4.3 as well actually, but what I've noticed is that it specifies a weird charset for the content: "windows-1258", which according to Wikipedia is "a code page used in Microsoft Windows to represent Vietnamese texts.".

Using your attached example, if I manually update the charsets to iso-8859-1, I'm able to see the entire content for both the text/plain and text/html parts. I'm not sure if this is an issue on my end (or with zbateson/mb-wrapper), with php, or with the incorrect charset specified... any ideas?

Lepelley commented 3 years ago

Hello @zbateson, Well... I have no Idea, but that's strange that PHP 7.4.6 returns good result too, but not even 7.4.3. I'm getting the mail directly from Gmail API, the wrong encoding must have been when they sent the mail.

zbateson commented 3 years ago

I've narrowed it down to an iconv function, so this could be system-specific, down to the version of iconv being used potentially (or existing in php's implementation of the function calling iconv, lol).

In zbateson/mb-wrapper, I end up calling:

iconv_substr($decodedText, 0, 2037, 'CP1258');

$decodedText containing the html or text part after being quoted-printable decoded. Unfortunately iconv_substr is only returning 11 characters, and I'm not sure why. It seems to successfully convert from CP1258 to UTF-8, and calling iconv_strlen on $decodedText also returns '2037' in this case.

I noticed converting to UTF-8, then calling mb_substr seems to work (mb_substr doesn't support these Windows charsets and some others, hence why it's using iconv). Unfortunately that's additional work getting the correct results, but I've had to do that elsewhere too anyway.

zbateson commented 3 years ago

Oh! I went in to create a test and it seems I was kind of aware of this:

https://github.com/zbateson/mb-wrapper/blob/718f357861735d463afd9ebf38c002b08d06dcea/tests/MbWrapper/MbWrapperTest.php#L156

I have a comment that reads "// seems to fail only on CP1258, returns incorrect number of characters with iconv_substr". Aah well, I guess time to work that out ;)

zbateson commented 3 years ago

This is fixed in zbateson/mb-wrapper 1.0.1. I released a new mail-mime-parser version 1.3.0 which requires that version, but just updating your dependencies in 1.x will also work.

If you get a chance, please have a look and make sure all is well for you now :)

Lepelley commented 3 years ago

Works perfectly, thank you !