Closed thokari closed 1 year ago
Basically, as most OpenCV functions require some arguments in pixels, it is almost impossible to have a "generic" processing that works with hard coded parameters.
That is why, if you look at the tables/metrics.py
file, a connected component analysis is ran on every image in order to compute 2 metrics:
char_length
: the average character length in the imagemedian_line_sep
: the median distance between each text lineAfter that, almost all parameters used in OpenCV functions are a ratio of one of these two parameters in order to adapt the processing steps to each image characteristics.
This enables myself to get rid of parameters and have something generic that works decently on any image out-of-the-box. The aim of the library is to keep it super simple to manipulate and I would rather not add some complexity by introducing CV parameters because :
That is why I do not really want to add the option to tweak processing parameters and I would rather improve the accuracy of the current algorithm.
IMO, most people will never tweak any parameter anyway...
TLDR: The library is designed to "adapt" itself to any image and I do not want to add complexity for users.
I agree. But consider a table that has columns detected where there are none, possibly because of handwritten content, or letters that are well aligned.
So it would be nice to have a single parameter, between 0 and 1, which represents "line confidence cutoff", where settting it to a low value would include all lines, and setting it to a high value would only include "important" lines. This could be calculated by how much of the table a line is spanning. If a line doesn't cross the whole table, it should have lower confidence, be ignored according to the cutoff, and its neighboring cells be joined.
Alternatively, if image processing parameters where exposed, specifically the "rho" and "theta" parameter of this function, this might also be helpful for cases like this. https://github.com/xavctn/img2table/blob/053c05aa9c3d7d3309527aef92aef68470631ca0/src/img2table/tables/processing/bordered_tables/lines.py#L253
The first option would also be possible to do after extracting tables, but it requires recomputing a lot of information. Consider this example document:
So it would be nice to have a single parameter, between 0 and 1, which represents "line confidence cutoff", where settting it to a low value would include all lines, and setting it to a high value would only include "important" lines.
That might be an option, I can possibly create a kind of "empiric" scoring taking into account several factors: line span, if the line is crossing a word or not...
Alternatively, if image processing parameters where exposed, specifically the "rho" and "theta" parameter of this function, this might also be helpful for cases like this.
Those parameters do not have a great impact. The threshold value is much more important for line confidence / line detection in the Hough transform.
I will take a look at the implementation of a first version of a "line confidence score" and check if it can be relevant when I have time
Removing lines that are crossing words is a good idea.
By the way, is there an option with your library to provide the OCR text externally, as alternative to ocr=None
?
We already have OCR in our pipeline, so if possible we would like to re-use the result.
Natively, no but it can be done.
Basically, you will have to map your OCR results into a polars dataframe and create an OCRDataframe instance. You will find examples of this implementation if you check files related to the supported OCR solutions in the ocr section of the library.
After that, a "pseudo" implentation would look like this :
# Mapping of your raw ocr to a polars dataframe
polars_ocr_df = map_ocr(raw_ocr)
# Create OCRDataframe instance
ocr_df = OCRDataframe(df=polars_ocr_df)
# Create a document instance: Image or PDF
doc = Image(src="..")
doc.ocr_df = ocr_df
tables = doc.extract_tables(ocr=None, ...)
With this, it will skip the OCR part and use your provided OCR to create resulting tables
In some files, notably in the
tables/processing
package, there are magic numbers. Maybe we could work on naming them and exposing them as parameters?