zelon88 / HRConvert2

A self-hosted, drag-and-drop & nosql file conversion server & share tool that supports 445 file formats in 13 languages.
https://github.com/zelon88/HRConvert2
GNU General Public License v3.0
1.11k stars 65 forks source link

Feature Request: Decrypting a PDF file without OWNER password #10

Open AlexanderSch90 opened 3 years ago

AlexanderSch90 commented 3 years ago

Briefly, a secured PDF file has two types of password: OWNER and USER. The OWNER password is used to enforce permissions. The USER password is used to open the pdf file.

Sometimes, downloaded pdf (such as your bank statements) are secured/encrypted by default. I want to decrypt this even without OWNER password using HRConvert2.

Please add this feature.

zelon88 commented 2 years ago

Sounds fun! I'll give it a shot.

zelon88 commented 2 years ago

Ok, I have an update on this.

I have scoured Github for POCs of this being done... recently. Most of the results seem to be from several years ago. The most promising method seems to be; http://pdfcrack.sourceforge.net/

You can install it with sudo apt-get install pdfcrack. It uses a CPU based brute force method to try and crack the USER password. One fault I noticed is that it takes an insanely long time to guess a password. I tested this with a PDF using a password of 123456 and a wordlist of 123456 and it was fast. I think if we were to implement a feature like this we would need to do some heuristics beforehand. Either try using only numbers, or wordlists of common passwords before either moving on to a full brute force attempt or simply giving up. One option would be to start the scan and then instruct the user to come back later to check on the status. Give them a unique 32 digit code they can enter to check the status of their file later, and then defer automatic file deletion until they come back (or a min threshold has elapsed). Either that or the user keeps the page open and we keep refreshing the status.

However this opens the server up to a potential DDOS attack. This eats up a ton of CPU and might realistically still never find a password. One user could keep submitting these requests until the server has no more resources left. It looks like there's no way to set execution cap in pdfcrack, which would buy us a little more time. Or create a queue with a limited number of workers. That means we would have to cap execution for each request at some point even if we haven't found the password.

Ultimately this is a TON of programming and debugging for a feature that is ill-placed in HRConvert2. If it were a fast process that the user didn't have to wait for then I would say lets go for it, but there is no guarantee the operation will succeed (infact most requests would probably fail or time out) and there's no good core mechanism for making the user wait. HRConvert2 was meant to create temporary scratch space for anonymized users. This feature would be better suited to HRCloud3, which is in process. In fact, the recent refactor of HRConvert2 will probably end up serving as the basis for the HRCloud3 cloudCore. When this happens I will experiment with PDF cracking some more, because in that environment it makes more sense to ask the user to wait for the operation to complete.

Then I tried; https://github.com/machine1337/pdfcrack

This was really flashy and promising looking. It was obviously made to work on Kalli linux, as the pdfid and pdf-parser packages are on Kali and this script tries to install them using aptitude. No worries, we just remove the dependency installer code, download the pdfid.py and pdf-parser.py scripts from https://blog.didierstevens.com/programs/pdf-tools/ and hard code the paths. Now we get to see that this is just another CPU based brute force approach. This approach actually just supplies it's own wordlist to regular PDFCrack. It can also generate passwords with Hashcat and PDF2John (both of which utilize the GPU) but then it just supplies those back to PDFCrack to see if they are valid. It basically just combines several of the methods one would use to brute force a PDF password into one script. I like the methodology but if this is just calling a bunch of dependencies we can do the job better in PHP and cut out the middle man. At this point I stopped testing this program because I know what the results will be. On it's best day this program will be able to crack a PDF password somewhat faster than pure PDFCrack, and probably comparable to whatever heuristics we apply using PHP.

https://github.com/philpem/cuda_pdfcrack I reviewed the code for cuda_pdfcrack first and found that it obviously requires nVidia CUDA support with a full graphics stack on the server. This would be problematic to add to HRConvert2 as a dependency because many home servers do not have the hardware support for something like this. Even I am developing this stuff in virtual machines where there is no VGA passthrough. But I kept going a little bit because I was curious. Usage is somewhat hacky. You need to first run vanilla PDFCrack to get the password hashes. Then you submit those to cuda-pdfcrack and it uses the GPU to brute force the password. If we used this method we would get that information using the pdfid and pdf-parser tools instead of PDFCrack.

Some other notable mentions that specifically mention that they cannot bypass a Document Open password; https://github.com/SeppPenner/PdfPasswordRemover https://github.com/jakepetroules/littlebirdy

An important note about ALL of these tools is their age. These tools all suppose a 4 digit minimum password length, which was changed to 6 digits in Acrobat DC version 21.005.20048. This seems to have been a client side change, meaning it's impossible to tell by version number which files have a 4 digit password and which files have a 6 digit password length.

In conclusion, it is possible to crack the passwords in a PDF, although extremely time and resource consuming. The duration of a PDF cracking operation would require me to develop at least 500 lines of additional code just to perform heuristics on the PDF, then another 500 to try and crack it using whatever hardware means are available. Then another 500 lines of code to handle the user waiting for the operation to complete or leave and come back. Even then the success rate might be 15% in cases where the server has no GPU and maybe... MAYBE 50% on servers that have GPU capabilities. I suspect the only passwords we would ever discover would be generic ones. Zip codes, common words, ect. If the PDF is using modern 128 or 256 bit AES encryption you can just forget about opening it.

This research was very valuable to the HonestRepair product line, even if it doesn't make a good fit for HRConvert2. At the moment. Thank you for your suggestion!

zelon88 commented 2 years ago

More reading... https://blog.didierstevens.com/2017/12/26/cracking-encrypted-pdfs-part-1/ https://blog.didierstevens.com/2017/12/27/cracking-encrypted-pdfs-part-2/ https://blog.didierstevens.com/2017/12/28/cracking-encrypted-pdfs-part-3/ https://blog.didierstevens.com/2017/12/29/cracking-encrypted-pdfs-conclusion/

Oredna commented 2 years ago

@zelon88 i dont know how they do it, but https://smallpdf.com/unlock-pdf unlock the file much faster. Also if you have a locked pdf and open it in Firefox you can bypass the printing restriction and then just save it as pdf - is this something that can be replicated?

zelon88 commented 1 year ago

Thanks for supporting the project with your suggestion.

I will give this a test and report back.

Oredna commented 1 year ago

Any update? Can we expect it to be included in the 3.2 version? Any way we can support the development - through money or other means