ykarpovich / msg.reader

Outlook Item File (.msg) reader in JavaScript
Apache License 2.0
121 stars 43 forks source link

HTML support needed #5

Open nestoru opened 7 years ago

nestoru commented 7 years ago

Excellent work I would say. Supporting msg format from JS is a must!

Supporting html is very important. Since I could not reopen https://github.com/ykarpovich/msg.reader/issues/1 and I do not know if closed issues are monitored I am opening a new issue now. The demo page fails to parse the msg in full when html is included. Is it difficult to enhance the library to support that? Perhaps the author can share some ideas?

In the meantime here is a workaround:

  1. Use msgconvert linux command to go from msg to eml:
    sudo apt install -y  libemail-outlook-message-perl
    cd /tmp
    msgconvert test\ with\ html\ content.msg # creates  test\ with\ html\ content.eml
  2. Use https://github.com/nodemailer/mailparser to get the information from the eml, for example:
git clone https://github.com/nodemailer/mailparser.git
npm install
cd mailparser/examples
node extractInfoFromEml.js /tmp/test\ with\ html\ content.eml 
  1. Below is the code for extractInfoFromEml.js
/* eslint no-console:0 */

'use strict';

const util = require('util');
const fs = require('fs');
const simpleParser = require('../lib/simple-parser.js');
const args = process.argv.slice(2);
const filePath = args[0];

let input = fs.createReadStream(filePath);

simpleParser(input)
    .then(mail => {
        console.log(util.inspect(mail, false, 22));
    })
    .catch(err => {
        console.log(err);
    });

Best regards, - Nestor

Aymkdn commented 6 years ago

HTML support would be fantastic :-)

0sander commented 6 years ago

Hi. HTML support can be easily added by adding the line:

'1013': 'htmlBody',

below line 78 in the msg.reader.js file

getFileData() will then also return htmlBody. To convert to String use:

new TextDecoder("utf-8").decode(fileData.htmlBody);

Maybe this can be added to the code?

Aymkdn commented 6 years ago

'1013': 'htmlBody'

I tried it already @0sander, and this is not working as expected in my tests. I thought it could be that simple, but it's not... :(

(I tried with HTML emails coming from Outlook)

0sander commented 6 years ago

I did some more tests and you are right: It only works in some cases - sometimes the htmlBody Array only contains the first 64 elements / characters, while in other cases it contains the whole HTML message.

I have tried with Outlook 2010.

ashsearle commented 5 years ago

@0sander See PR #7 for a fix affecting HTML embedded as binary.

matthiasg commented 5 years ago

is this fixed now that #7 is included ?

visgotti commented 5 years ago

my body html is appearing like - bodyHTML: "���9buT"

any ideas?

matthiasg commented 5 years ago

@visgotti same here. bodyHTML seems to not parse correctly. In my case it only reads 16 bytes from the corresponding block. The same would happen if #7 wasn't applied so that is not it.

maybe @ykarpovich has some idea where that goes wrong ?

Example mail with some attachments: https://drive.google.com/file/d/1Qt1I0w1TTP6H-Z4ZrGnbWcBoehEgEHtS/view?usp=sharing

...
bodyHTML: "yz�buT.��(  B"
...
ashsearle commented 5 years ago

Try changing this line https://github.com/ykarpovich/msg.reader/blob/master/msg.reader.js#L396 From:

    if (fieldName) {

To:

    if (fieldName && !fields[fieldName]) {

There's something weird with the full HTML being extracted and then getting overwritten as the parser continues traversing through the .msg file's data structure.

matthiasg commented 5 years ago

@ashsearle I did try that just now. No change. The change only triggers on the body field anyway not on the bodyHtml field.

ykarpovich commented 5 years ago

At this moment the issue still exists. In some cases .msg contains valid body as HTML, but it's really rare. Looks like HTML body message stored as RTF body attachment (or something like that).

Any way need to investigate how to retrieve HTML body in 100% cases. Based on the specification it doesn't work for now

matthiasg commented 5 years ago

@ykarpovich sounds likely it is RTF (Outlook should use HTML as default but maybe they convert it internally depending on some use cases). will try to read the pure data as RTF

PS: I see there are some JS converts already https://github.com/iarna/rtf-to-html PPS: Sometimes Outlook uses a compressed RTF body format which might complicate things, but that should be stored in another property (https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtagrtfcompressed-canonical-property)

adjenks commented 4 years ago

In some cases Outlook doesn't even parse its own format correctly...

michael-barchy-cpasbxl commented 3 years ago

Any news on this one and RTF ? This might help https://github.com/HiraokaHyperTools/msgreader

ejaz-abhifinance commented 3 years ago

Excellent work I would say. Supporting msg format from JS is a must!

Supporting html is very important. Since I could not reopen #1 and I do not know if closed issues are monitored I am opening a new issue now. The demo page fails to parse the msg in full when html is included. Is it difficult to enhance the library to support that? Perhaps the author can share some ideas?

In the meantime here is a workaround:

  1. Use msgconvert linux command to go from msg to eml:
sudo apt install -y  libemail-outlook-message-perl
cd /tmp
msgconvert test\ with\ html\ content.msg # creates  test\ with\ html\ content.eml
  1. Use https://github.com/nodemailer/mailparser to get the information from the eml, for example:
git clone https://github.com/nodemailer/mailparser.git
npm install
cd mailparser/examples
node extractInfoFromEml.js /tmp/test\ with\ html\ content.eml 
  1. Below is the code for extractInfoFromEml.js
/* eslint no-console:0 */

'use strict';

const util = require('util');
const fs = require('fs');
const simpleParser = require('../lib/simple-parser.js');
const args = process.argv.slice(2);
const filePath = args[0];

let input = fs.createReadStream(filePath);

simpleParser(input)
    .then(mail => {
        console.log(util.inspect(mail, false, 22));
    })
    .catch(err => {
        console.log(err);
    });

Best regards, - Nestor

how to use these commands in code. kindly help @ nestoru

bilal68 commented 3 years ago

At this moment the issue still exists. In some cases .msg contains valid body as HTML, but it's really rare. Looks like HTML body message stored as RTF body attachment (or something like that).

Any way need to investigate how to retrieve HTML body in 100% cases. Based on the specification it doesn't work for now

In my case I am getting the file locally on server from a folder & it did't give me the html body, but if I parse it through the example given by you it shows the content over there...I have attached the image of the response after parsing on the server side & image of the example as well. what I am doing wrong kindly need a little help. lots of appericating in advance. thanks serverside image example image

bilal68 commented 3 years ago

how to get the image having "cid" link src from the html body. any idea?

skalg commented 3 years ago

Is there any way to open .msg file automatically with this html page ? Without having to browse for a file. Something like 'start firefox.exe http://localhost/example c:\file.msg' Or a javascript cmd to show it in a browser ?

Thank you

FROGGS commented 2 years ago

how to get the image having "cid" link src from the html body. any idea?

For that I added the following line to the NAME_MAPPING section:

'3701': 'embeddedImage',

EDIT: Though what I just realize now that there is this section, but I dont know what it does:

CLASS_MAPPING: {
    ATTACHMENT_DATA: '3701'
},

I can access embedded images like this: image Next step is to replace the cid-urls by image blobs.

FROGGS commented 2 years ago

About RTF encoded HTML: I was curious so I looked at all extracted information that do not get a name and so are dropped. I stumbled upon 1009 and called it bodyRTF. After fighting hours to get npm packages to work that are meant to run using node.js (I need a browser solution), I used the following code for stripping the RTF annotation: https://stackoverflow.com/a/188877/3989858

I did have to modify it a little to make it work, and I still have to fiddle with embedded images, but it looks promising so far.

I could provide this as an angular project if someone is interested.

eduardomb08 commented 1 year ago

ide this as an angular project if someone is interest

@FROGGS , do you have a fork of this project with your changes/fixes?