xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
571 stars 76 forks source link

tables>processing>bordered_tables>cells>indentification.py identify_cells raise ZeroDivisionError: division by zero #220

Open hbh112233abc opened 2 months ago

hbh112233abc commented 2 months ago

test data

    h_lines_arr = np.array(
        [
            [250, 707, 2302, 707],
            [250, 825, 2302, 825],
            [250, 954, 2302, 954],
            [250, 1066, 1977, 1066],
            [1977, 1066, 2302, 1066],
            [250, 1192, 1703, 1192],
            [1977, 1192, 2302, 1192],
            [250, 1268, 1703, 1268],
            [1977, 1268, 2302, 1268],
            [250, 1346, 1703, 1346],
            [1977, 1346, 2302, 1346],
            [250, 1423, 1703, 1423],
            [1977, 1423, 2302, 1423],
            [250, 1500, 1703, 1500],
            [1977, 1500, 2302, 1500],
            [250, 1770, 1703, 1770],
            [1977, 1770, 2302, 1770],
            [250, 2118, 1703, 2118],
            [1977, 2118, 2302, 2118],
            [250, 2301, 1703, 2301],
            [1977, 2301, 2302, 2301],
            [250, 2401, 1703, 2401],
            [1977, 2401, 2302, 2401],
            [250, 2498, 1703, 2498],
            [1703, 2498, 1703, 2498],
            [1977, 2498, 2302, 2498],
            [250, 2601, 981, 2601],
            [1977, 2601, 2302, 2601],
            [366, 2736, 981, 2736],
            [1977, 2736, 2302, 2736],
            [366, 2872, 981, 2872],
            [1977, 2872, 2302, 2872],
            [366, 3007, 2302, 3007],
            [366, 3143, 2302, 3143],
            [250, 3278, 2302, 3278],
            [250, 2040, 2302, 2040],
            [250, 2194, 2302, 2194],
        ],
        np.int64,
    )
    v_lines_array = np.array(
        [
            [250, 707, 250, 3278],
            [366, 1066, 366, 2118],
            [366, 2601, 366, 3278],
            [523, 707, 523, 1066],
            [981, 1066, 981, 3278],
            [1222, 825, 1222, 1066],
            [1300, 1066, 1300, 2118],
            [1434, 707, 1434, 1066],
            [1703, 707, 1703, 2498],
            [1977, 825, 1977, 3007],
            [2302, 707, 2302, 3278],
        ],
        np.int64,
    )

remove it's wrapper @njit("int64[:,:](int64[:,:],int64[:,:])", cache=True, fastmath=True),exception not raise

MathieuSeraphim commented 1 month ago

Can confirm.

The problem occurs due to lines 30 and 31:

l_corresponds = -0.02 <= (x1i - x1j) / (x2i - x1i) <= 0.02
r_corresponds = -0.02 <= (x2i - x2j) / (x2i - x1i) <= 0.02

with, at line 22:

x1i, y1i, x2i, y2i = h_lines_arr[i][:]

Basically, if the x coordinates of a horizontal line match (i.e. the line is 1 pixel wide), this generates a division by 0. In the example ablve, for instance, h_lines_arr contains the following line: [1703, 2498, 1703, 2498], Commenting the @njit(...) wrapper at line 11 just turns the ZeroDivisionError into a RuntimeWarning on my end.

Ideally, one-pixel-wide horizontal lines (i.e. points) shouldn't have been identified as horizontal lines in the first place. A quick fix would be to add this before line 30:

if x1i == x2i:
    continue