Closed MansMeg closed 7 months ago
Sounds good to me! Regarding downloading motions, you also find all motions since 1867 on the riksdag webpage: https://www.riksdagen.se/sv/sok/?avd=dokument&doktyp=mot
Sounds good
I have retrieved the pdf of each motion from 1867 to 1970. One problem arises with finding the remaining pdfs from the riksdagen website, as some years have pdf and some do not. It is very inconsistent. We should ask someone there for help.
Also another issue that arises is the conversion from pdf to jpg/png before applying tesseract (as tesseract only takes images) may worsen the quality. I would assume the original documents have been scanned to jpg/png, and getting access to those images would be ideal.
AFAIK most scanned PDF files have the full original image in them, so scraping that from them should work fine. I have used the command line utility 'pdfimages' to do that.
Here's my old fish script for that:
echo $argv
set folder $argv[1]
for firstname in $folder/*
#set firstname $argv[1]
echo "Convert $firstname"
set plainname (string replace $folder "" $firstname)
set plainname (string replace .pdf "" $plainname)
echo "($plainname)"
#rm -rf images/$plainname
mkdir -p images/$plainname
pdfimages -j $firstname images/$plainname/$plainname
fish ocr.fish $plainname
rm -rf images/$plainname
end
and ocr.fish
set plainname $argv[1]
for filename in images/$plainname/*.*
set plain_filename (string replace .jpg "" $filename)
set plain_filename (string replace .pbm "" $plain_filename)
set plain_filename (string replace images/$plainname/ "" $plain_filename)
echo $filename
echo $plain_filename
mkdir -p altofiles/$plainname
# XML output:
#tesseract -l eng $filename altofiles/$plainname/$plain_filename -c tessedit_create_alto=1
# TXT output
tesseract -l eng $filename altofiles/$plainname/$plain_filename
end
The pdfimages line should be usable with bash etc. too.
@liamtabib I guess you can visually check whether 'pdfimages' yields good enough quality.
I will start running the OCR engine on motioner 1867-1970 inside a computer at Kblab. Where should we eventually store the files?
I think it would be best if they were in a place where we could all access them remotely (e.g by rsync). Maybe the 'dolan' computer at ekonomikum -- don't know if there's space on it necessarily, but sth like that would be a good option until we get the text into some xml schema and on github.
Yes. Agree with @BobBorges . The easiest is to create a new repo in the swerik project for these files. Thats better than to have the locally somewhere.
I suggest a subfolder called ”alto” and the each year stored under.
similarly we should probably store the pdfs in a second repo.
Lets discuss this tomorrow.
The pdf's are 110 GB in size. That feels too much for github? @MansMeg @BobBorges
I dont think it is any problem short term. We just need to use git lfs: https://docs.github.com/en/billing/managing-billing-for-git-large-file-storage/about-billing-for-git-large-file-storage
It will cost roughly $15 a month for this so it is managable for now. I think it would be good if @ninpnin would be involved when adding this to git. Since we need ti get this right the first time.
We could also zip the pdfs before we upload them. That can save some space.
What's the size range for individual files? I know GH limits individual files (50MB), but I don't find any info about repository size limits.
They want them to be smaller (1-5GB). For larger corpora they recommend git large file system (gitlfs) where the whole version history is only stored in the cloud. Then you pay $5 per 50gB.
The size for individual file exceeds 50MB for a few pdfs
Thats no problem with git lfs: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage
Then the limit is 2GB. If we have such large PDF files we probably want to think twice about such a file.
Also, could you check how much we gain by compressing the PDF files using zip? ie we would store motion1.pdf.zip instead for each pdf file.
The OCR'd pages are mostly going to be static themselves, so git LFS or github releases are better options than storing them in the git history.
The zipped pdf files take approximetely 20GB.
Thats great! Then what we should do is to zip each pdf separately. Ill try to setup the repos asap.
Then I think it could be good if you @liamtabib could submit just one or two zipped pdf documents so we see the structure and me, @ninpnin and @BobBorges could take a look at it before you start to submit all. We also going to need @fredrik1984 to open up the wallet for at least one data package of $5/month.
The wallet issue is OK – for how long do you think we will have and pay for it? Ca throughout the project? Just so I have a rough idea.
Until we move the corpus to a server at the Riksdagen Library.
Sounds good. Let me know when I should open up the project wallet.
Here are two examples of zipped pdf, using zip command with -9 flag.
mot_1867fk55.pdf.zip mot_1869ak5.pdf.zip
Should we remove the pdf from the name of the .zip file? should it be motion.zip?
Great!
I tried to unzip it and it seems like you have ziped it in a folder. I don't remember the flag to undo that by heart.
I have now setup two repos. I think you need to read up on git lfs (git large file system) @liamtabib on how to add files, its a little special the first time. Then you can test to upload two zipped motions in pdf to the pdf repo to see if it works. Let us know when you are done and we can test that it works before you start to upload more.
Alright! here is one file that does not unzip along with the parent folder. mot_1868ak286.zip
I will upload samples tomorrow on the new repo, then on Wednesday I will be at kblab, where all the pdfs reside, and if the samples are approved I will push all the pdfs. I think the OCR engine will be done by Wednesday, should I also upload the alto xml files?
Great! It worked better.
Lets start with adding the PDFs. Then we can start to work with the ALTO files so we know that there are no mistakes with them. Then we are happy with the ALTO-files, we will upload them.
Would it make more sense to add them as a release on this repo? https://github.com/swerik-project/riksdagen-motions/releases
EDIT: 20GB is no issue, we have releases of that magnitude on the riksdagen-corpus repo
I guess since the ALTO files and the PDF are static, it is good to have them as separate repos?
The sample files have been added as lfs files, check them out to see if they are correct
Separate repo may be advantageous as we will push batches of new files to the repository as they are delivered from Lars at Riksdagen. And rescans may be performed after the quality analysis.
I read the output of the zip command wrong (I read deflation rate as new size of file), therefore the sizes close to 100GB for compressed files with zip.
I'm able to clone/pull/unzip/view the pdf files.
Great! Should we push the pdf files to the repository, even though they will take close to 100GB? The compression saves around 15% of size, so is it worth it to compress?
Let me check as well.
Hi!
I checked and it worked for me as well. I think we can skip using zip if we only gain 15%. Then I think it is better to avoid the hassle of zipping the files.
Could you test with the same files but change them to PDF instead and maybe add a couple of more years (say 10 files). Because before you upload all files, Fredrik needs to pay for the data packs. So we should upload them in batches to be sure everything works as expected. Also, you need to change git attribute file to handle all pdf.
I think this should be sufficient (but Im not 100% sure):
*.pdf filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
files/1868/*.zip filter=lfs diff=lfs merge=lfs -text
files/1869/*.zip filter=lfs diff=lfs merge=lfs -text
mot_1869__ak__8.zip filter=lfs diff=lfs merge=lfs -text
files/1868/mot_1868__ak__286.zip filter=lfs diff=lfs merge=lfs -text
I have now registered a monthly payment of 15 $ (150 GB). I think everything is set to upload the motions.
New test files have been uploaded
Excellent. I think this is looking great! Let's discuss more tomorrow, but I think we can start to add files now.
I would check so there are no extremely large files when you upload them. Maybe also upload them in batches of random samples of files.
I struggle a bit with using git LFS, so I accidently removed a few objects in my .git directory, which is causing problems. I want to restart the uploading of all the files from the beginning, and to do that I have to delete the files, which github advices not to do, as they will still be in the remote server. Instead, Github advices to delete the repo and start a new repo:
After you remove files from Git LFS, the Git LFS objects still exist on the remote storage and will continue to count toward your Git LFS storage quota.
To remove Git LFS objects from a repository, delete and recreate the repository. When you delete a repository, any associated issues, stars, and forks are also deleted. For more information, see "Deleting a repository." If you need to purge a removed object and you are unable to delete the repository, please contact support for help.
Will you be able to do that, @MansMeg ?
Have you tried to remove your git repository locally and download it again?
I.e. restart the process from the latest commit.
I think it is possible but it would be easier to start from a clean slate
Not for me.
I have a new harddrive on its way that I need to store my local copy, I will try again then
@liamtabib I have now updated the tasks above for the reOCR process. Please tick the boxes that are done now and comment if something is missing.
Next step is to push the alto files from tvåkammarriksdagen, and the initial ocr-estimation. This will be done this week.
We need to reOCR the motioner since these are currently OCRed by The Swedish National library with relatively poor quality.
What we need to do is: