xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
557 stars 75 forks source link

Expose image processing parameters #74

Closed thokari closed 1 year ago

thokari commented 1 year ago

In some files, notably in the tables/processing package, there are magic numbers. Maybe we could work on naming them and exposing them as parameters?

xavctn commented 1 year ago

Basically, as most OpenCV functions require some arguments in pixels, it is almost impossible to have a "generic" processing that works with hard coded parameters.

That is why, if you look at the tables/metrics.py file, a connected component analysis is ran on every image in order to compute 2 metrics:

After that, almost all parameters used in OpenCV functions are a ratio of one of these two parameters in order to adapt the processing steps to each image characteristics.

This enables myself to get rid of parameters and have something generic that works decently on any image out-of-the-box. The aim of the library is to keep it super simple to manipulate and I would rather not add some complexity by introducing CV parameters because :

  1. In order to be able to properly comprehend what each parameter does, you should at least a decent understanding of what the algorithm does and I have not written any documentation about what is implemented.
  2. As I derive all parameters from the 2 metrics, I do not have the possibility to directly expose multiple parameters to the user without a major code base overhaul.

That is why I do not really want to add the option to tweak processing parameters and I would rather improve the accuracy of the current algorithm.

IMO, most people will never tweak any parameter anyway...

TLDR: The library is designed to "adapt" itself to any image and I do not want to add complexity for users.

thokari commented 1 year ago

I agree. But consider a table that has columns detected where there are none, possibly because of handwritten content, or letters that are well aligned.

So it would be nice to have a single parameter, between 0 and 1, which represents "line confidence cutoff", where settting it to a low value would include all lines, and setting it to a high value would only include "important" lines. This could be calculated by how much of the table a line is spanning. If a line doesn't cross the whole table, it should have lower confidence, be ignored according to the cutoff, and its neighboring cells be joined.

Alternatively, if image processing parameters where exposed, specifically the "rho" and "theta" parameter of this function, this might also be helpful for cases like this. https://github.com/xavctn/img2table/blob/053c05aa9c3d7d3309527aef92aef68470631ca0/src/img2table/tables/processing/bordered_tables/lines.py#L253

The first option would also be possible to do after extracting tables, but it requires recomputing a lot of information. Consider this example document: image

xavctn commented 1 year ago

So it would be nice to have a single parameter, between 0 and 1, which represents "line confidence cutoff", where settting it to a low value would include all lines, and setting it to a high value would only include "important" lines.

That might be an option, I can possibly create a kind of "empiric" scoring taking into account several factors: line span, if the line is crossing a word or not...

Alternatively, if image processing parameters where exposed, specifically the "rho" and "theta" parameter of this function, this might also be helpful for cases like this.

Those parameters do not have a great impact. The threshold value is much more important for line confidence / line detection in the Hough transform.

I will take a look at the implementation of a first version of a "line confidence score" and check if it can be relevant when I have time

thokari commented 1 year ago

Removing lines that are crossing words is a good idea. By the way, is there an option with your library to provide the OCR text externally, as alternative to ocr=None?

We already have OCR in our pipeline, so if possible we would like to re-use the result.

xavctn commented 1 year ago

Natively, no but it can be done.

Basically, you will have to map your OCR results into a polars dataframe and create an OCRDataframe instance. You will find examples of this implementation if you check files related to the supported OCR solutions in the ocr section of the library.

After that, a "pseudo" implentation would look like this :

# Mapping of your raw ocr to a polars dataframe
polars_ocr_df = map_ocr(raw_ocr)

# Create OCRDataframe instance
ocr_df = OCRDataframe(df=polars_ocr_df)

# Create a document instance: Image or PDF
doc = Image(src="..")
doc.ocr_df = ocr_df

tables = doc.extract_tables(ocr=None, ...)

With this, it will skip the OCR part and use your provided OCR to create resulting tables