Closed guillenotfound closed 3 months ago
@guillenotfound
I am using The rc4cryptoapi_password_node.doc file you provided is displaying fine in both MS Office and WPS as shown below:
I wonder if it's your version of WPS?
node.js decrypted files and python decrypted files, indeed not exactly the same, node.js decrypted out of more than a "\u0001Sh33tJ5" stream. now it seems to be pdftron is not able to recognize this, so it can not be parsed, and WPS and MS Office can be.
I'll see if I can get node.js to return without the "\u0001Sh33tJ5" stream.
I've just updated WPS Office from 3.2.0
to 5.7.3
and it seems to be working fine 👍
Based on that it would be fair to close this issue since I already opened one on PDFTron's end. But before doing that I did yet another test by uploading provided files to Google Drive.
I'm able to open the Python one but not the Node one so there's some differences which are preventing those files to be rendered in other tools, I'd understand if you wouldn't want to invest time on this since the official tools are working just fine and these are in general very old files, if that would be the case please give me a few pointers and I'll try to figure out what other differences are there besides the the extra stream.
Thanks again!
@guillenotfound
The key point is this"\u0001Sh33tJ5" stream.
node.js decrypted after more than python out of this, should be the problem that led to the failure of the pdftron
parse, I briefly investigated, it may be cfb
in the processing of the "\u0001Sh33tJ5" stream of the try to have bugs.officecrypt-tool
source code is there to delete this stream, but the result is not deleted. If you want to handle it yourself, just delete the "\u0001Sh33tJ5" stream
.
I saw that you have this in place: https://github.com/zurmokeeper/officecrypto-tool/blob/1e0bb680d2a54c1946f9f95083c85d4442ae3633/src/util/doc97.js#L335
But the problem is that when doing write it will add that stream, if you parse output right after doing CFB.write, the stream will be present.
@guillenotfound
Yes, that's the problem, CFB.write will always add the '\u0001Sh33tJ5' stream, but that should be up to the caller to decide, still trying to figure out how I can get rid of that.
Interestingly I've removed that stream manually and tried to open the file again without success. So we can discard this one.
It looks like this change is fixing the issue for both PDFTron and Google Docs, I'm guessing they use WordDocument
and never fallback to wordDocument
.
Do you think this could be a valid fix? I don't have enough knowledge on the matter.
@guillenotfound
Good, that's the problem, I can't believe I misspelled the case w. Feel free to mention PR to fix it!
Interestingly I've removed that stream manually and tried to open the file again without success. So we can discard this one.
It looks like this change is fixing the issue for both PDFTron and Google Docs, I'm guessing they use
WordDocument
and never fallback towordDocument
.Do you think this could be a valid fix? I don't have enough knowledge on the matter.
By the way, how exactly did you maneuver this manual deletion of the stream? Can I see your code? Or did you use another tool to delete it?
By the way, how exactly did you maneuver this manual deletion of the stream? Can I see your code? Or did you use another tool to delete it?
I used the debugger inside CFB.write(output)
, the just .splice(pos, 1)
to remove from both, it was right after rebuild_cfb(cfb);
which is the one adding that stream IIRC. But from my understanding the that stream should just be ignored.
We are also losing metadata like date because we do recreate the doc, maybe creating a copy and then overriding whichever streams need to be decrypted will do the trick.
This entry is not showing up in the Python document:
While the file seems to be working well in MS Office, when I try to use it with https://demo.pdftron.com/demo/ it won't work while Python's one does work. I understand that if the file is vieweable in the official viewer we should assume that the decryption is correct, but since both tools are the same I was expecting both of them to provide the same output file?
This is the original file I'm using (pwd:
Password1234_
): https://github.com/nolze/msoffcrypto-tool/blob/master/tests/inputs/rc4cryptoapi_password.docThe zip contains the original file (the one I mention above) and the two decrypted files using this tool and Python one: files.zip
Additionally this is how it looks in WPS Office: