Open grahama1970 opened 2 months ago
Hello,
As mentionned in the documentation, when processing PDFs, all pages are converted to images using a DPI of 200. The table coordinates returned by the library correspond to this image.
When using PyMuPDF, the coordinates returned are the one corresponding to the PDF page mediabox.
Here is an example of how I am handling the relationship/conversion between those 2 sets of coordinates.
Hope it helps.
It does. Thank you :)
from img2table.document import PDF
from img2table.ocr import TesseractOCR
tesseract_ocr = TesseractOCR(n_threads=1, lang="eng")
pdf_path = '/path/to/pdf'
pdf = PDF(src=pdf_path)
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
implicit_rows=False,
borderless_tables=False,
min_confidence=50)
target_dpi = 72
original_dpi = 200
for page, tables in extracted_tables.items():
for idx, table in enumerate(tables):
print(page, idx)
original_bbox_dict = {"x1": table.bbox.x1, "y1": table.bbox.y1, "x2": table.bbox.x2, "y2": table.bbox.y2}
pymupdf_bbox_dict = {
"x1": (table.bbox.x1 * target_dpi) / original_dpi,
"y1": (table.bbox.y1 * target_dpi) / original_dpi,
"x2": (table.bbox.x2 * target_dpi) / original_dpi,
"y2": (table.bbox.y2 * target_dpi) / original_dpi
}
print(f'original_bbox_dict: {original_bbox_dict}')
print(f'pymupdf_bbox_dict: {pymupdf_bbox_dict}')
Result (Accurate):
pymupdf_bbox_dict: {'x1': 72.36, 'y1': 72.36, 'x2': 541.08, 'y2': 478.08}
Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes. The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box. The PDF bounding box is off. Is this a known issue, or is there a work-around?
PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0) Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)
Much appreciation in advance
Extra for debugging:
Image2Table using the PDF (text extraction) module.
Extracted Image2Table table is:
bbox = (201, 201, 1503, 1328)
PyMuPDF:
extracted table with PymuPDF is:
bbox = (72.0375, 72.0625, 540.4875, 561.0)