winkelement / ocrstream

ResourceSpace plugin to integrate Optical Character Recognition through tesseract.
MIT License
8 stars 0 forks source link

Compare text recognition for different settings #14

Open winkelement opened 9 years ago

winkelement commented 9 years ago

Testfile: Scanned PDF, 300 dpi, Calibri 11pt ocr_typo_test_300dpi_jpeg (jpeg Preview)

winkelement commented 9 years ago

Using PHP similar_text to calculate percentage of similarity between original text and OCR result.

winkelement commented 9 years ago

Convert options (png): -colorspace gray -type grayscale -density 300 -geometry 1024 -crop 0x0+0+0 -quality 90 -trim -deskew 40% -normalize -adaptive-sharpen 0x1 tesseract options: -l deu -psm 3 Match: 95.69 %

winkelement commented 9 years ago

Convert options (jpg): -colorspace gray -type grayscale -density 300 -geometry 1024 -crop 0x0+0+0 -quality 90 -trim -deskew 40% -normalize -adaptive-sharpen 0x1 tesseract options: -l deu -psm 3 Match: 95.98 %

winkelement commented 9 years ago

Convert options (tif): -colorspace gray -type grayscale -density 300 -geometry 1024 -crop 0x0+0+0 -quality 90 -trim -deskew 40% -normalize -adaptive-sharpen 0x1 -depth 8 tesseract options: -l deu -psm 3 Match: 95.83 %

winkelement commented 9 years ago

Convert options (jpg): -colorspace gray -type grayscale -density 300 -geometry 1024 -crop 834x535+97+93 -quality 90 -trim -deskew 40% -normalize -sharpen 0x1 tesseract options: -l deu -psm 3 Match: 96.81 %