welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

reOCR motions #337

Closed MansMeg closed 7 months ago

MansMeg commented 1 year ago

We need to reOCR the motioner since these are currently OCRed by The Swedish National library with relatively poor quality.

What we need to do is:

fredrik1984 commented 1 year ago

Sounds good to me! Regarding downloading motions, you also find all motions since 1867 on the riksdag webpage: https://www.riksdagen.se/sv/sok/?avd=dokument&doktyp=mot

liamtabib commented 1 year ago

Sounds good

liamtabib commented 1 year ago

I have retrieved the pdf of each motion from 1867 to 1970. One problem arises with finding the remaining pdfs from the riksdagen website, as some years have pdf and some do not. It is very inconsistent. We should ask someone there for help.

Also another issue that arises is the conversion from pdf to jpg/png before applying tesseract (as tesseract only takes images) may worsen the quality. I would assume the original documents have been scanned to jpg/png, and getting access to those images would be ideal.

ninpnin commented 1 year ago

AFAIK most scanned PDF files have the full original image in them, so scraping that from them should work fine. I have used the command line utility 'pdfimages' to do that.

Here's my old fish script for that:

echo $argv
set folder $argv[1]
for firstname in $folder/*
    #set firstname $argv[1]
    echo "Convert $firstname"
    set plainname (string replace $folder "" $firstname)
    set plainname (string replace .pdf "" $plainname)

    echo "($plainname)"
    #rm -rf images/$plainname
    mkdir -p images/$plainname
    pdfimages -j $firstname images/$plainname/$plainname
    fish ocr.fish $plainname
    rm -rf images/$plainname
end                                                                                                                      

and ocr.fish

set plainname $argv[1]
for filename in images/$plainname/*.*
    set plain_filename (string replace .jpg "" $filename)
    set plain_filename (string replace .pbm "" $plain_filename)
    set plain_filename (string replace images/$plainname/ "" $plain_filename)
    echo $filename
    echo $plain_filename
    mkdir -p altofiles/$plainname

    # XML output:
    #tesseract -l eng $filename altofiles/$plainname/$plain_filename -c tessedit_create_alto=1

    # TXT output
    tesseract -l eng $filename altofiles/$plainname/$plain_filename
end

The pdfimages line should be usable with bash etc. too.

ninpnin commented 1 year ago

@liamtabib I guess you can visually check whether 'pdfimages' yields good enough quality.

liamtabib commented 1 year ago

I will start running the OCR engine on motioner 1867-1970 inside a computer at Kblab. Where should we eventually store the files?

BobBorges commented 1 year ago

I think it would be best if they were in a place where we could all access them remotely (e.g by rsync). Maybe the 'dolan' computer at ekonomikum -- don't know if there's space on it necessarily, but sth like that would be a good option until we get the text into some xml schema and on github.

MansMeg commented 1 year ago

Yes. Agree with @BobBorges . The easiest is to create a new repo in the swerik project for these files. Thats better than to have the locally somewhere.

I suggest a subfolder called ”alto” and the each year stored under.

similarly we should probably store the pdfs in a second repo.

Lets discuss this tomorrow.

liamtabib commented 1 year ago

The pdf's are 110 GB in size. That feels too much for github? @MansMeg @BobBorges

MansMeg commented 1 year ago

I dont think it is any problem short term. We just need to use git lfs: https://docs.github.com/en/billing/managing-billing-for-git-large-file-storage/about-billing-for-git-large-file-storage

It will cost roughly $15 a month for this so it is managable for now. I think it would be good if @ninpnin would be involved when adding this to git. Since we need ti get this right the first time.

MansMeg commented 1 year ago

We could also zip the pdfs before we upload them. That can save some space.

BobBorges commented 1 year ago

What's the size range for individual files? I know GH limits individual files (50MB), but I don't find any info about repository size limits.

MansMeg commented 1 year ago

They want them to be smaller (1-5GB). For larger corpora they recommend git large file system (gitlfs) where the whole version history is only stored in the cloud. Then you pay $5 per 50gB.

liamtabib commented 1 year ago

The size for individual file exceeds 50MB for a few pdfs

MansMeg commented 1 year ago

Thats no problem with git lfs: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage

Then the limit is 2GB. If we have such large PDF files we probably want to think twice about such a file.

Also, could you check how much we gain by compressing the PDF files using zip? ie we would store motion1.pdf.zip instead for each pdf file.

ninpnin commented 1 year ago

The OCR'd pages are mostly going to be static themselves, so git LFS or github releases are better options than storing them in the git history.

liamtabib commented 1 year ago

The zipped pdf files take approximetely 20GB.

MansMeg commented 1 year ago

Thats great! Then what we should do is to zip each pdf separately. Ill try to setup the repos asap.

Then I think it could be good if you @liamtabib could submit just one or two zipped pdf documents so we see the structure and me, @ninpnin and @BobBorges could take a look at it before you start to submit all. We also going to need @fredrik1984 to open up the wallet for at least one data package of $5/month.

fredrik1984 commented 1 year ago

The wallet issue is OK – for how long do you think we will have and pay for it? Ca throughout the project? Just so I have a rough idea.

MansMeg commented 1 year ago

Until we move the corpus to a server at the Riksdagen Library.

fredrik1984 commented 1 year ago

Sounds good. Let me know when I should open up the project wallet.

liamtabib commented 1 year ago

Here are two examples of zipped pdf, using zip command with -9 flag.

mot_1867fk55.pdf.zip mot_1869ak5.pdf.zip

Should we remove the pdf from the name of the .zip file? should it be motion.zip?

MansMeg commented 1 year ago

Great!

I tried to unzip it and it seems like you have ziped it in a folder. I don't remember the flag to undo that by heart.

MansMeg commented 1 year ago

I have now setup two repos. I think you need to read up on git lfs (git large file system) @liamtabib on how to add files, its a little special the first time. Then you can test to upload two zipped motions in pdf to the pdf repo to see if it works. Let us know when you are done and we can test that it works before you start to upload more.

liamtabib commented 1 year ago

Alright! here is one file that does not unzip along with the parent folder. mot_1868ak286.zip

I will upload samples tomorrow on the new repo, then on Wednesday I will be at kblab, where all the pdfs reside, and if the samples are approved I will push all the pdfs. I think the OCR engine will be done by Wednesday, should I also upload the alto xml files?

MansMeg commented 1 year ago

Great! It worked better.

Lets start with adding the PDFs. Then we can start to work with the ALTO files so we know that there are no mistakes with them. Then we are happy with the ALTO-files, we will upload them.

ninpnin commented 1 year ago

Would it make more sense to add them as a release on this repo? https://github.com/swerik-project/riksdagen-motions/releases

EDIT: 20GB is no issue, we have releases of that magnitude on the riksdagen-corpus repo

MansMeg commented 1 year ago

I guess since the ALTO files and the PDF are static, it is good to have them as separate repos?

liamtabib commented 1 year ago

The sample files have been added as lfs files, check them out to see if they are correct

liamtabib commented 1 year ago

Separate repo may be advantageous as we will push batches of new files to the repository as they are delivered from Lars at Riksdagen. And rescans may be performed after the quality analysis.

liamtabib commented 1 year ago

I read the output of the zip command wrong (I read deflation rate as new size of file), therefore the sizes close to 100GB for compressed files with zip.

BobBorges commented 1 year ago

I'm able to clone/pull/unzip/view the pdf files.

liamtabib commented 1 year ago

Great! Should we push the pdf files to the repository, even though they will take close to 100GB? The compression saves around 15% of size, so is it worth it to compress?

MansMeg commented 1 year ago

Let me check as well.

MansMeg commented 1 year ago

Hi!

I checked and it worked for me as well. I think we can skip using zip if we only gain 15%. Then I think it is better to avoid the hassle of zipping the files.

Could you test with the same files but change them to PDF instead and maybe add a couple of more years (say 10 files). Because before you upload all files, Fredrik needs to pay for the data packs. So we should upload them in batches to be sure everything works as expected. Also, you need to change git attribute file to handle all pdf.

I think this should be sufficient (but Im not 100% sure):

*.pdf filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
files/1868/*.zip filter=lfs diff=lfs merge=lfs -text
files/1869/*.zip filter=lfs diff=lfs merge=lfs -text
mot_1869__ak__8.zip filter=lfs diff=lfs merge=lfs -text
files/1868/mot_1868__ak__286.zip filter=lfs diff=lfs merge=lfs -text
fredrik1984 commented 12 months ago

I have now registered a monthly payment of 15 $ (150 GB). I think everything is set to upload the motions.

liamtabib commented 12 months ago

New test files have been uploaded

MansMeg commented 12 months ago

Excellent. I think this is looking great! Let's discuss more tomorrow, but I think we can start to add files now.

I would check so there are no extremely large files when you upload them. Maybe also upload them in batches of random samples of files.

liamtabib commented 12 months ago

I struggle a bit with using git LFS, so I accidently removed a few objects in my .git directory, which is causing problems. I want to restart the uploading of all the files from the beginning, and to do that I have to delete the files, which github advices not to do, as they will still be in the remote server. Instead, Github advices to delete the repo and start a new repo:

After you remove files from Git LFS, the Git LFS objects still exist on the remote storage and will continue to count toward your Git LFS storage quota.

To remove Git LFS objects from a repository, delete and recreate the repository. When you delete a repository, any associated issues, stars, and forks are also deleted. For more information, see "Deleting a repository." If you need to purge a removed object and you are unable to delete the repository, please contact support for help.

Will you be able to do that, @MansMeg ?

MansMeg commented 12 months ago

Have you tried to remove your git repository locally and download it again?

MansMeg commented 12 months ago

I.e. restart the process from the latest commit.

liamtabib commented 12 months ago

I think it is possible but it would be easier to start from a clean slate

MansMeg commented 12 months ago

Not for me.

liamtabib commented 12 months ago

I have a new harddrive on its way that I need to store my local copy, I will try again then

MansMeg commented 11 months ago

@liamtabib I have now updated the tasks above for the reOCR process. Please tick the boxes that are done now and comment if something is missing.

liamtabib commented 11 months ago

Next step is to push the alto files from tvåkammarriksdagen, and the initial ocr-estimation. This will be done this week.