wevote / EndorsementExtension

We Vote Endorsements Chrome Extension
1 stars 7 forks source link

Handle endorsements in PDFs #70

Closed SailingSteve closed 1 year ago

SailingSteve commented 4 years ago

For example

https://www.iuoe399.org/media/filer_public/45/77/457700c9-dd70-4cfc-be49-a81cb3fba0a6/2020_lu399_primary_endorsement.pdf

This issue requires the installation of pdfminer.six on the python host https://github.com/pdfminer/pdfminer.six stars/forks 1.7k/1.4k Updated 6 days ago Now handles Py 3 65 issues

SailingSteve commented 4 years ago

(WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminer % pwd /Users/stevepodell/PycharmProjects/pdfminer (WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminer %

pip install pdfminer /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py

https://www.idrsolutions.com/online-pdf-to-html-converter does it perfectly, but you have to pay for it ... $400 for 500 pages per month


Try https://github.com/HazyResearch/pdftotree Load https://www.activestate.com/ to get access to the python3-tk Skip, unmaintained warning

SailingSteve commented 4 years ago

The first one I tried was https://pypi.org/project/pdfminer/ https://github.com/euske/pdfminer Stars/forks 4.2k/1.4k 177 issues Updated 2 months ago https://wevote.s3-us-west-1.amazonaws.com/output2.html

Did a pip uninstall in-between, since they seem intertwined

The second one I tried was (and the winner is) https://github.com/pdfminer/pdfminer.six stars/forks 1.7k/1.4k Updated 6 days ago Now handles Py 3 65 issues

SailingSteve commented 4 years ago

(WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminerSix % /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 2: A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags.: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 3: import: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 4: import: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 5: import: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 7: import: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 8: import: command not found /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 12: syntax error near unexpected token OUTPUT_TYPES' /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py: line 12:OUTPUT_TYPES = ((".htm", "html"),' (WeVoteServerPy3.7) stevepodell@Steves-MacBook-Pro-32GB-Oct-2018 pdfminerSix %

SailingSteve commented 4 years ago

for https://github.com/pdfminer/pdfminer.six I had to copy cp /Users/stevepodell/PycharmEnvironments/WeVoteServerPy3.7/bin/pdf2txt.py . to the cwd, then python pdf2txt.py -o outputSix.html 2020_lu399_primary_endorsement.pdf which worked, with seemingly identical output to https://github.com/euske/pdfminer https://wevote.s3-us-west-1.amazonaws.com/outputSix.html And I did not have to change any source to get it going

SailingSteve commented 1 year ago

PDFs are being converted and highlighted at this time.

SailingSteve commented 1 year ago

PDFs are being converted and highlighted at this time.

SailingSteve commented 1 year ago

PDFs are being converted and highlighted at this time.