xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
526 stars 74 forks source link

PDF table.box is inaccurate? #218

Open grahama1970 opened 2 days ago

grahama1970 commented 2 days ago

Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes. The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box. The PDF bounding box is off. Is this a known issue, or is there a work-around?

PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0) Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)

Much appreciation in advance

Extra for debugging:

Image2Table using the PDF (text extraction) module.

# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

extracted_tables

Extracted Image2Table table is: bbox = (201, 201, 1503, 1328)

PyMuPDF:

doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
    tabs = doc[page_num].find_tables()  # detect the tables

    # print(page_num, tabs)
    print(doc[page_num].rect.height)
    for i, tab in enumerate(tabs):  # iterate over all tables
        for cell in tab.header.cells:
            doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
        print(f"  Table bbox: {tab.bbox}")
        doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
        print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

extracted table with PymuPDF is: bbox = (72.0375, 72.0625, 540.4875, 561.0)

xavctn commented 2 days ago

Hello,

As mentionned in the documentation, when processing PDFs, all pages are converted to images using a DPI of 200. The table coordinates returned by the library correspond to this image.

When using PyMuPDF, the coordinates returned are the one corresponding to the PDF page mediabox.

Here is an example of how I am handling the relationship/conversion between those 2 sets of coordinates.

Hope it helps.

grahama1970 commented 2 days ago

It does. Thank you :)

from img2table.document import PDF
from img2table.ocr import TesseractOCR
tesseract_ocr = TesseractOCR(n_threads=1, lang="eng")
pdf_path = '/path/to/pdf'

pdf = PDF(src=pdf_path)

extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

target_dpi = 72
original_dpi = 200
for page, tables in extracted_tables.items():
    for idx, table in enumerate(tables):
        print(page, idx)

        original_bbox_dict = {"x1": table.bbox.x1, "y1": table.bbox.y1, "x2": table.bbox.x2, "y2": table.bbox.y2}

        pymupdf_bbox_dict = {
            "x1": (table.bbox.x1 * target_dpi) / original_dpi,
            "y1": (table.bbox.y1 * target_dpi) / original_dpi,
            "x2": (table.bbox.x2 * target_dpi) / original_dpi,
            "y2": (table.bbox.y2 * target_dpi) / original_dpi
        }

        print(f'original_bbox_dict: {original_bbox_dict}')
        print(f'pymupdf_bbox_dict: {pymupdf_bbox_dict}')

Result (Accurate):

pymupdf_bbox_dict: {'x1': 72.36, 'y1': 72.36, 'x2': 541.08, 'y2': 478.08}