xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
571 stars 76 forks source link

Performance warnings by `polars` when extracting tables #211

Closed huyfififi closed 2 months ago

huyfififi commented 3 months ago

Polars report performance warnings when extracting tables

from img2table.document import Image

image = Image("test.png", detect_rotation=False)
result = image.extract_tables()
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/img2table/tables/processing/bordered_tables/cells/identification.py:17: PerformanceWarning: Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
  .rename({col: f"{col}_" for col in df_h_lines.columns})
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/img2table/tables/processing/bordered_tables/cells/deduplication.py:21: PerformanceWarning: Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
  .rename({col: f"{col}_" for col in df_cells.columns})
result=[ExtractedTable(title=None, bbox=(36, 21, 770, 327),shape=(6, 3)), ExtractedTable(title=None, bbox=(962, 21, 1154, 123),shape=(2, 2))]

I confirmed that collecting column names beforehand fixes the issue

+    # Collect the schema and get column names
+    column_names = df_cells.collect_schema().names()
+
     # Create copy of df_cells
     df_cells_cp = (df_cells.clone()
-                   .rename({col: f"{col}_" for col in df_cells.columns})
+                   .rename({col: f"{col}_" for col in column_names})
                    )

Because the warning messages make it hard to see other stdout/stderr, and also to address the warning just in case (though I doubt it causes performance issues), we might want to follow the suggestions by polars.

I'm new to the OSS world and would appreciate your guidance

huyfififi commented 2 months ago

I overlooked there are pull requests already #203 #205

huyfififi commented 2 months ago

Fixed in #213 and released with https://github.com/xavctn/img2table/releases/tag/1.3.0