Closed SailingSteve closed 1 year ago
(WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminer % pwd /Users/stevepodell/PycharmProjects/pdfminer (WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminer %
https://www.idrsolutions.com/online-pdf-to-html-converter does it perfectly, but you have to pay for it ... $400 for 500 pages per month
Try https://github.com/HazyResearch/pdftotree Load https://www.activestate.com/ to get access to the python3-tk Skip, unmaintained warning
The first one I tried was https://pypi.org/project/pdfminer/ https://github.com/euske/pdfminer Stars/forks 4.2k/1.4k 177 issues Updated 2 months ago https://wevote.s3-us-west-1.amazonaws.com/output2.html
Did a pip uninstall in-between, since they seem intertwined
The second one I tried was (and the winner is) https://github.com/pdfminer/pdfminer.six stars/forks 1.7k/1.4k Updated 6 days ago Now handles Py 3 65 issues
(WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminerSix % /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 2: A command line tool for extracting text and images from PDF and
output it to plain text, html, xml or tags.: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 3: import: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 4: import: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 5: import: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 7: import: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 8: import: command not found
/Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 12: syntax error near unexpected token OUTPUT_TYPES' /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 12:
OUTPUT_TYPES = ((".htm", "html"),'
(WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminerSix %
for https://github.com/pdfminer/pdfminer.six
I had to copy
cp /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py .
to the cwd, then
python pdf2txt.py -o outputSix.html 2020_lu399_primary_endorsement.pdf
which worked, with seemingly identical output to https://github.com/euske/pdfminer
https://wevote.s3-us-west-1.amazonaws.com/outputSix.html
And I did not have to change any source to get it going
PDFs are being converted and highlighted at this time.
PDFs are being converted and highlighted at this time.
PDFs are being converted and highlighted at this time.
For example
https://www.iuoe399.org/media/filer_public/45/77/457700c9-dd70-4cfc-be49-a81cb3fba0a6/2020_lu399_primary_endorsement.pdf
This issue requires the installation of pdfminer.six on the python host https://github.com/pdfminer/pdfminer.six stars/forks 1.7k/1.4k Updated 6 days ago Now handles Py 3 65 issues