yigitkonur / swift-ocr-llm-powered-pdf-to-markdown

An open-source OCR API that leverages OpenAI's powerful language models with optimized performance techniques like parallel processing and batching to deliver high-quality text extraction from complex PDF documents. Ideal for businesses seeking efficient document digitization and data extraction solutions.
Other
695 stars 49 forks source link

Extraction succeeded but missing a lot of pages #8

Closed douglasqian closed 1 month ago

douglasqian commented 1 month ago

Context

Setup

I followed the README instructions to set it up and was able to successfully call the API locally from another terminal window after booting up the server. Here are the logs, everything seems to work well:

INFO:     Started server process [35129]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
2024-09-24 15:51:32,180 - main - INFO - Read uploaded PDF file, size: 1701282 bytes.
2024-09-24 15:51:32,181 - main - INFO - Saved PDF to temporary file /var/folders/0f/5gf_y5cj1_vd_4bjqccn5j3r0000gn/T/tmpiqykyiun.pdf.
2024-09-24 15:51:32,188 - main - INFO - PDF loaded with 164 pages.
2024-09-24 15:51:36,880 - main - INFO - Converted PDF to 164 images using PyMuPDF.
2024-09-24 15:51:37,022 - main - INFO - Encoded 164 images to base64 data URLs.
2024-09-24 15:51:37,023 - main - INFO - Divided images into 17 batches of up to 10 images each.
2024-09-24 15:51:37,023 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,047 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,068 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,089 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,111 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,132 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,152 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,174 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,195 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,215 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,236 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,257 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,278 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,299 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,319 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,340 - main - INFO - Sending OCR request to OpenAI with 10 images.
2024-09-24 15:51:37,361 - main - INFO - Sending OCR request to OpenAI with 4 images.
2024-09-24 15:51:52,066 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:52,072 - main - INFO - Extracted text length: 3200 characters.
2024-09-24 15:51:53,070 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:53,072 - main - INFO - Extracted text length: 3999 characters.
2024-09-24 15:51:53,806 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:53,808 - main - INFO - Extracted text length: 3856 characters.
2024-09-24 15:51:54,007 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:54,008 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:54,009 - main - INFO - Extracted text length: 4463 characters.
2024-09-24 15:51:54,010 - main - INFO - Extracted text length: 3870 characters.
2024-09-24 15:51:54,151 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:54,153 - main - INFO - Extracted text length: 4070 characters.
2024-09-24 15:51:54,317 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:54,319 - main - INFO - Extracted text length: 3904 characters.
2024-09-24 15:51:54,628 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:54,630 - main - INFO - Extracted text length: 8172 characters.
2024-09-24 15:51:55,262 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:55,264 - main - INFO - Extracted text length: 4273 characters.
2024-09-24 15:51:55,465 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:55,467 - main - INFO - Extracted text length: 4538 characters.
2024-09-24 15:51:56,571 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:56,573 - main - INFO - Extracted text length: 4125 characters.
2024-09-24 15:51:56,680 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:56,682 - main - INFO - Extracted text length: 3804 characters.
2024-09-24 15:51:57,388 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:57,391 - main - INFO - Extracted text length: 4442 characters.
2024-09-24 15:51:57,449 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:57,451 - main - INFO - Extracted text length: 3915 characters.
2024-09-24 15:51:57,799 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:57,801 - main - INFO - Extracted text length: 3769 characters.
2024-09-24 15:51:57,801 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:51:57,803 - main - INFO - Extracted text length: 3924 characters.
2024-09-24 15:52:09,012 - httpx - INFO - HTTP Request: POST https://streamline-openai.openai.azure.com//openai/deployments/gpt4o/chat/completions?api-version=2023-06-01-preview "HTTP/1.1 200 OK"
2024-09-24 15:52:09,014 - main - INFO - Extracted text length: 3869 characters.
2024-09-24 15:52:09,015 - main - INFO - Total extracted text length: 72225 characters.
2024-09-24 15:52:09,015 - main - INFO - Deleted temporary PDF file /var/folders/0f/5gf_y5cj1_vd_4bjqccn5j3r0000gn/T/tmpiqykyiun.pdf.
INFO:     127.0.0.1:52750 - "POST /ocr HTTP/1.1" 200 OK

Results

But the actual results from SwiftOCR were quite far off. For reference here is the actual PDF: FundOpp_DE-FOA-0003294_Amd_000003.pdf

Here was the result from PyMuPDF4LLM: pymupdf4llm.md

Here was the result from SwiftOCR: swift_ocr.md

It seems like the last snippet that was merged in was on page 163 out of 164 so it did cover the entire document page range correctly. But it seems like there are a significant number of pages in the middle that were dropped.

Am I missing something?

douglasqian commented 1 month ago

Update: I figured this out. It was the configurations in the env file.

The key is to size your batch in accordance with the max_output_tokens config set in the LLM call. By default the repo sets 800 which is too low for me so I tuned it up to 4000 along with these configs

BATCH_SIZE=3  # Optional: Default is 1
MAX_CONCURRENT_OCR_REQUESTS=5  # Optional: Default is 5
MAX_CONCURRENT_PDF_CONVERSION=4  # Optional: Default is 4

Probably helps to mention this in the README

yigitkonur commented 1 month ago

Great observation, will fix the max_output_token limit!

yigitkonur commented 1 month ago

Thank you for letting me know about it! https://github.com/yigitkonur/swift-ocr-llm-powered-pdf-to-markdown/commit/98566d91f68cf7de42eb8710e35253a374376f5b