support for a good Japanese ocr

Kamikadashi commented 3 years ago

Hi! Could you please consider adding support for Japanese 読取革命 ocr? I've found it, on average, give results equal or even better than those of google ocr and others, especially when dealing with bad quality low-resolution scans. It's significantly better (and cheaper) than Abbyy and doesn't require an internet connection as well. 読取革命 comes with a folder watcher that monitors a folder specified by the user and ocrs any image as soon as it appears in it as well as a program that ocrs any image that gets copied to the clipboard, so I think there are ways to make it work.
The free trial is available here: https://download4.www.sourcenext.com/yomikaku16/YOMIKAKU16.exe And apparently, both the folder watcher and the clipboard watcher keep working even after the trial expires, retaining all their functionality (you can run them from the installation folder without registering anything). The quality of the ocr can be further improved by converting the image into an indexed one with a palette of only two colors, so it would be fantastic if you added it as a pre-processing option. Still, it's by no means a requirement. Here are some examples: https://imgur.com/a/ZibcALR

xulihang commented 3 years ago

I can't run it under Windows 10 and its interface is all Japanese.

I can run it under Windows 7, but the result is not good. Maybe it is because the system's language is Chinese.

Kamikadashi commented 3 years ago

Yes, it seems to be the system language that is at fault here. You can fix it by running the application through https://pooi.moe/Locale-Emulator/ or something similar. However, it is very strange that it does not run on Windows 10 as the main difference between version 16 and version 15 is the support for Windows 10 and higher. But anyway, you don't need to run the main program at all. 3434

You can find the two applications I mentioned in the installation folder. The first one is the folder watcher, and the second one is the clipboard watcher. ClipOCR will require Japanese system language as well, but FWatch doesn't. I translated a few menus to make it easier to understand what does what: ClipOCR: ClipOCR_EfgEW1NDo81

FWatch: FWatch_VgD9G63DcJ1 FWatch_2j6MZLkgCT1 FWatch_4eiJclfOFt1

xulihang commented 3 years ago

Okay, I've made it work on Windows 10.

I think the folder watcher is the best way to integrate it into ImageTrans. I've made a plugin for it.

https://github.com/xulihang/ImageTrans_plugins/commit/2a5fc70b6f0a0de2f36949f5d5fd8dc1093dd53a

Unzip the plugin files to the plugins folder and set up the watch folder path and timeout in ImageTrans's preferences.

yomikaku_plugin.zip

I check strip furigana to make the results better.

I think it does not perform well on text with complex background or a low resolution. Sometimes tesseract and Windows 10's OCR give a better result.

Kamikadashi commented 3 years ago

Thank you a lot, I just tested it, and it works! It’s true that it doesn’t deal well with complex backgrounds but thankfully not all manga has a lot of them. It’s fast though, and quite accurate. The accuracy can be further improved by batch upscaling images with the nearest neighbor before importing them into ImageTrans; the process takes less than a minute or so for a whole volume. Or using waifu2x-caffe if you have a capable gpu, but that’s admittedly more time-consuming. Strangely enough, if some kanji got recognized wrong, moving a box a bit to the side and clicking ocr again usually fixes it.

I encountered a problem when dealing with low-quality scans relating to furigana remover (?) though. The problem appears at some particular bubbles and persists no matter whether the image was upscaled or not, so I don’t think that’s the reason. Some examples (in short, the algorithm sometimes removes text instead of furigana, or overlooks furigana instead): https://imgur.com/a/67BGd3r

xulihang commented 3 years ago

About upscaling, in the current version, only if the width or height of the cropped area is smaller than 50px will the program scale it up. There is an ncnn version of waifu2x https://github.com/nihui/waifu2x-ncnn-vulkan, which is very fast. Maybe I can integrate it into the program.

About moving a box a bit to improve the result, it is true but I don't know the reason.

About the furigana remover, if the furigana is connected to the kanji, the current method cannot remove it perfectly. Could you send a batch of the original text area images so that I can have a test?

xulihang commented 3 years ago

I've made some improvements about Japanese OCR in recent versions.

In 1.4.5, a new furigana stripping method is added. In 1.4.6, a new line mode for tesseract: https://github.com/xulihang/ImageTrans-docs/issues/87

Now, tesseract should be able to outperform Yomikaku.

Kamikadashi commented 3 years ago

I haven't tested the new tesseract mode extensively yet. However, from what I've seen, it unfortunately still doesn't outperform yomikaku on the scans I tried it on, getting kanji wrong more often than its counterpart.

The new furigana stripping method looks promising, though. Did it replace the previous one, or is it available as an option somewhere in the settings? I may be wrong as I haven't tested anything extensively yet, but at first glance, I didn't see any problems with the vanishing text I encountered earlier, so that's something.

Regarding upscaling, not all models are equally suitable for improving OCR, and I've found only cunet and upresnet10 models to give consistently better results. For extreme cases, ESRGAN with the 8x_NMKD-Typescale_175k model can be used with Yomikaku to get an ocr rate even for scans of this quality to a decent enough level of 98% accuracy or so. Still, this model can't be used with ai based ocr services as with them results get worse instead. And besides, it's too slow and resources heavy, so I don't think it's feasible to integrate it into imagetrans. Cunet and upresnet10 are good enough most of the time though and ai ocr handles them well.

But Yomikaku doesn't handle text on complex backgrounds or light text on dark backgrounds, so it would be preferable to find something better for such cases.

xulihang commented 3 years ago

I just tested one bubble. There should be a better way to test this, like using a dataset, but I haven't tried,

Yomikaku:

Tesseract (line mode)

As the line mode precisely cuts text block into lines, it should have a high accuracy. I found that this mode also works for Yomikaku. Maybe I will enable line mode for all offline OCR engines.

xulihang commented 3 years ago

Commercial OCRs like baidu_accurate can recognize all the characters:

xulihang commented 3 years ago

OCR the same bubble after using waifu2x.

Yomikaku:

Tesseract:

Tesseract has a problem that it may repeat the recognized character. This may be because it uses a CTC algorithm,

Kamikadashi commented 3 years ago

waifu2x is not enough for this quality unfortunately. Yomikaku with 8x_NMKD-Typescale_175k model:

xulihang commented 3 years ago

I see and Yomikaku do have a better accuracy recognizing Kanji,

Kamikadashi commented 3 years ago

It consistently can't recognize ハ and mistakes it for （ though.

Kamikadashi commented 3 years ago

Also, you may know this already, but apparently there is a way to get Google OCR to work via Google Drive API without resorting to Vision API. Like here: https://github.com/ttop32/JMTrans It would be cool to see this implemented.

Kamikadashi commented 3 years ago

I think moving a box a bit to improve the result is a thing for the same reason why some characters get ocred incorrectly if "strip furigana" or "vertical to horizontal" options are enabled. ImageTrans handles all textboxes as lossy jpg instead of any lossless formats, so the pixel representation of symbols changes slightly after any consequent read/write operation.

xulihang commented 3 years ago

Yes, I do use JPEG with 100 quality. I am not sure if it has much influence.

Dim out As OutputStream=File.OpenOutput(imgPath,"",False)
img.WriteToStream(out,"100","JPEG")
out.Close

xulihang / ImageTrans-docs

support for a good Japanese ocr #83