img2table
is a simple, easy to use, table identification and extraction Python Library based on OpenCV image
processing that supports most common image file formats as well as PDF files.
Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.
The library can be installed via pip:
pip install img2table
: Standard installation, supporting Tesseract
pip install img2table[paddle]
: For usage with Paddle OCR
pip install img2table[easyocr]
: For usage with EasyOCR
pip install img2table[gcp]
: For usage with Google Vision OCR
pip install img2table[aws]
: For usage with AWS Textract OCR
pip install img2table[azure]
: For usage with Azure Cognitive Services OCR
Images are loaded using the opencv-python
library, supported formats are listed below.
- Windows bitmaps - .bmp, .dib
- JPEG files - .jpeg, .jpg, *.jpe
- JPEG 2000 files - *.jp2
- Portable Network Graphics - *.png
- WebP - *.webp
- Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
- PFM files - *.pfm
- Sun rasters - .sr, .ras
- TIFF files - .tiff, .tif
- OpenEXR Image files - *.exr
- Radiance HDR - .hdr, .pic
- Raster and Vector geospatial data supported by GDAL
OpenCV: Image file reading and writing
Multi-page images are not supported.
Both native and scanned PDF files are supported.
Images are instantiated as follows :
from img2table.document import Image
image = Image(src,
detect_rotation=False)
Parameters
- src : str,
pathlib.Path
, bytes orio.BytesIO
, required- Image source
- detect_rotation : bool, optional, default
False
- Detect and correct skew/rotation of the image
The implemented method to handle skewed/rotated images supports skew angles up to 45° and is based on the publication by Huang, 2020.
Setting thedetect_rotation
parameter toTrue
, image coordinates and bounding boxes returned by other methods might not correspond to the original image.
PDF files are instantiated as follows :
from img2table.document import PDF
pdf = PDF(src,
pages=[0, 2],
detect_rotation=False,
pdf_text_extraction=True)
Parameters
- src : str,
pathlib.Path
, bytes orio.BytesIO
, required- PDF source
- pages : list, optional, default
None
- List of PDF page indexes to be processed. If None, all pages are processed
- detect_rotation : bool, optional, default
False
- Detect and correct skew/rotation of extracted images from the PDF
- pdf_text_extraction : bool, optional, default
True
- Extract text from the PDF file for native PDFs
PDF pages are converted to images with a 200 DPI for table identification.
img2table
provides an interface for several OCR services and tools in order to parse table content.
If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.
1
"eng"
11
tesseract --help-psm
for detailsNone
TESSDATA_PREFIX
env variable is used."en"
None
["en"]
None
Reader
constructor.False
None
ocr_predictor
method.None
15
None
None
None
None
None
COMPUTER_VISION_ENDPOINT
environment variable.None
COMPUTER_VISION_SUBSCRIPTION_KEY
environment variable.Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables
method of a document.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
implicit_rows=False,
borderless_tables=False,
min_confidence=50)
Parameters
- ocr : OCRInstance, optional, default
None
- OCR instance used to parse document text. If None, cells content will not be extracted
- implicit_rows : bool, optional, default
False
- Boolean indicating if implicit rows should be identified - check related example
- borderless_tables : bool, optional, default
False
- Boolean indicating if borderless tables are extracted on top of bordered tables.
- min_confidence : int, optional, default
50
- Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)
NB: Borderless table extraction can, by design, only extract tables with 3 or more columns.
The ExtractedTable
class is used to model extracted tables from documents.
Attributes
- bbox :
BBox
- Table bounding box
- title : str
- Extracted title of the table
- content :
OrderedDict
- Dict with row indexes as keys and list of
TableCell
objects as values- df :
pd.DataFrame
- Pandas DataFrame representation of the table
- html :
str
- HTML representation of the table
In order to access bounding boxes at the cell level, you can use the following code snippet :
for id_row, row in enumerate(table.content.values()):
for id_col, cell in enumerate(row):
x1 = cell.bbox.x1
y1 = cell.bbox.y1
x2 = cell.bbox.x2
y2 = cell.bbox.y2
value = cell.value
extract_tables
method from the Image
class returns a list of ExtractedTable
objects.
output = [ExtractedTable(...), ExtractedTable(...), ...]
extract_tables
method from the PDF
class returns an OrderedDict
object with page indexes as keys and lists of ExtractedTable
objects.
output = {
0: [ExtractedTable(...), ...],
1: [],
...
last_page: [ExtractedTable(...), ...]
}
Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.
Method arguments are mostly common with the extract_tables
method.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
ocr=ocr,
implicit_rows=False,
borderless_tables=False,
min_confidence=50)
Parameters
- dest : str,
pathlib.Path
orio.BytesIO
, required- Destination for xlsx file
- ocr : OCRInstance, optional, default
None
- OCR instance used to parse document text. If None, cells content will not be extracted
- implicit_rows : bool, optional, default
False
- Boolean indicating if implicit rows should be identified - check related example
- borderless_tables : bool, optional, default
False
- Boolean indicating if borderless tables are extracted. It requires to provide an OCR to the method in order to be performed - feature in alpha version
- min_confidence : int, optional, default
50
- Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)
Returns
If a
io.BytesIO
buffer is passed as dest arg, it is returned containing xlsx data
Several Jupyter notebooks with examples are available :
implicit_rows
of the extract_tables
method