Decrypting same file produces different output than Python tool

zurmokeeper / officecrypto-tool

officecrypto-tool is a library for js that can be used to decrypt and encrypt office(excel/ppt/word) files.

https://www.npmjs.com/package/officecrypto-tool

MIT License

19 stars 4 forks source link

Decrypting same file produces different output than Python tool #33

Closed guillenotfound closed 3 months ago

guillenotfound commented 4 months ago

This entry is not showing up in the Python document:

While the file seems to be working well in MS Office, when I try to use it with https://demo.pdftron.com/demo/ it won't work while Python's one does work. I understand that if the file is vieweable in the official viewer we should assume that the decryption is correct, but since both tools are the same I was expecting both of them to provide the same output file?

This is the original file I'm using (pwd: Password1234_): https://github.com/nolze/msoffcrypto-tool/blob/master/tests/inputs/rc4cryptoapi_password.doc

The zip contains the original file (the one I mention above) and the two decrypted files using this tool and Python one: files.zip

Additionally this is how it looks in WPS Office:

zurmokeeper commented 3 months ago

@guillenotfound

I am using The rc4cryptoapi_password_node.doc file you provided is displaying fine in both MS Office and WPS as shown below:

I wonder if it's your version of WPS?

node.js decrypted files and python decrypted files, indeed not exactly the same, node.js decrypted out of more than a "\u0001Sh33tJ5" stream. now it seems to be pdftron is not able to recognize this, so it can not be parsed, and WPS and MS Office can be.

I'll see if I can get node.js to return without the "\u0001Sh33tJ5" stream.

guillenotfound commented 3 months ago

I've just updated WPS Office from 3.2.0 to 5.7.3 and it seems to be working fine 👍

Based on that it would be fair to close this issue since I already opened one on PDFTron's end. But before doing that I did yet another test by uploading provided files to Google Drive.

I'm able to open the Python one but not the Node one so there's some differences which are preventing those files to be rendered in other tools, I'd understand if you wouldn't want to invest time on this since the official tools are working just fine and these are in general very old files, if that would be the case please give me a few pointers and I'll try to figure out what other differences are there besides the the extra stream.

Thanks again!

zurmokeeper commented 3 months ago

@guillenotfound

The key point is this"\u0001Sh33tJ5" stream. node.js decrypted after more than python out of this, should be the problem that led to the failure of the pdftronparse, I briefly investigated, it may be cfb in the processing of the "\u0001Sh33tJ5" stream of the try to have bugs.officecrypt-tool source code is there to delete this stream, but the result is not deleted. If you want to handle it yourself, just delete the "\u0001Sh33tJ5" stream.

guillenotfound commented 3 months ago

I saw that you have this in place: https://github.com/zurmokeeper/officecrypto-tool/blob/1e0bb680d2a54c1946f9f95083c85d4442ae3633/src/util/doc97.js#L335

But the problem is that when doing write it will add that stream, if you parse output right after doing CFB.write, the stream will be present.

zurmokeeper commented 3 months ago

@guillenotfound

Yes, that's the problem, CFB.write will always add the '\u0001Sh33tJ5' stream, but that should be up to the caller to decide, still trying to figure out how I can get rid of that.

guillenotfound commented 3 months ago

Interestingly I've removed that stream manually and tried to open the file again without success. So we can discard this one.

It looks like this change is fixing the issue for both PDFTron and Google Docs, I'm guessing they use WordDocument and never fallback to wordDocument.

Do you think this could be a valid fix? I don't have enough knowledge on the matter.

zurmokeeper commented 3 months ago

@guillenotfound

Good, that's the problem, I can't believe I misspelled the case w. Feel free to mention PR to fix it!

zurmokeeper commented 3 months ago

Interestingly I've removed that stream manually and tried to open the file again without success. So we can discard this one.

It looks like this change is fixing the issue for both PDFTron and Google Docs, I'm guessing they use WordDocument and never fallback to wordDocument.

Do you think this could be a valid fix? I don't have enough knowledge on the matter.

By the way, how exactly did you maneuver this manual deletion of the stream? Can I see your code? Or did you use another tool to delete it?

guillenotfound commented 3 months ago

By the way, how exactly did you maneuver this manual deletion of the stream? Can I see your code? Or did you use another tool to delete it?

I used the debugger inside CFB.write(output), the just .splice(pos, 1) to remove from both, it was right after rebuild_cfb(cfb); which is the one adding that stream IIRC. But from my understanding the that stream should just be ignored.

We are also losing metadata like date because we do recreate the doc, maybe creating a copy and then overriding whichever streams need to be decrypted will do the trick.