Finding the area where to put the translated english text

PottedRosePetal commented 11 months ago

What would your feature do?

Text rendering area is determined by detected text lines, not speech bubbles. This works for images without speech bubbles, but making it impossible to decide where to put translated English text. I have no idea how to solve this.

I am not entirely sure about the architecture so far, as I only had a rough look at the code. I could offer to write some code that takes in an image without text, some text translation, some coordinates from where the original text was and return the text input into the image - or at least regions where it would be okay to put the text.

given the coordinates of the text it would be easily possible to detect edges around the coordinates with the text removed, basically just go in every direction until there is a big fat black line (a bit more sophisticated, but you get the idea). Once I get some potential speech bubble, I can return a binary image with certain regions being on and the rest being off. Each region could ofc be numbered and correspond to one piece of text.

I would be up to do work on that during the next weeks, but before I start I would love to know if there is anything that could prevent me from doing it that way/would be unwanted? Also, are you aware of any similar work existing so far? and lastly, where should such a feature be put in the existing folder structure, OCR, detection, textline_merge...?

alternative ways to do this could be the following: https://stackoverflow.com/questions/34356635/detecting-comic-strip-dialogue-bubble-regions-in-images

zyddnys commented 11 months ago

What you are describing seems to aim at images with speech bubble, however I am talking about images without speech bubbles and the vertical text is too slim for horizonal English text to be put in place. Some examples irl19 irl1

Both images have 3 paragraphs, of course there's plenty of white empty area to put texts in, but if the vertical japanese text is slim and squeeze in between visual elements like the left most texts in this example. QQ图片20210619154037

if you have solutions i'd like to hear.

PottedRosePetal commented 11 months ago

Oh I see what you mean.

I think there are two solutions. One is very simple and you probably thought of that, but for completeness sake: very small things like "aaah" or "huh" or maybe even short sentences can be just put in the same space vertically.

another option is to do it a bit like its often done in manhwas. (https://asura.gg/everyone-else-is-a-returnee-chapter-45/ , it also has a few other examples, tho they are randomly translated or not... but it does show options)

To realize this automatically and without losing important stuff, you would need some sort of segmentation algorithm. This would need to segment stuff into entities such as bubbles, humans/monsters, buildings and open stuff like forests, sky, sea etc. Training this would be no mean task, but I believe I have seen something similar already online? Not sure. You could then set some sort of parameter for a page that decides how liberal it is with placing those texts - if they are allowed to overlap bubbles, monsters, forests etc. Another thing would be to introduce some transparency maybe?

secondly some algorithm would be needed to angle the text. In your third example the text would probably need to be angled into the moon to properly fit. It would be strictly inside the moon tho. This one would also need to scale the text accordingly and place it within a certain radius/segment of the original text, i.e. in your first image, lower left corner would probably be fine, but in the last one it would be confusing.

Those are all kind of hyperparameters that would also need to be set very carefully by the user or just plain trained once the other stuff is done.

All of those things can be done with machine learning, but dont have to be. Segmentation can be done with digital image processing, tho I can imagine this to be very hard for black/white images. Another thing would be to combine both, so basically segment badly with ML, just go for humans, and then use something like a watershed algorithm that avoids humans to determine where to write text.

All of that could work. But for it to work well, quite a lot of compute and effort would be needed to realize it, and I think in a few months to years nets like chatgpt will be able to process images as well, which would make this problem drastically less difficult. Also, authors would probably find some weird edge case that could completely mess stuff up. (wtf is this: )

TLDR: Segmentation using ML to determine areas that are valid to write stuff into would probably work.

zyddnys commented 11 months ago

I can't predict the future, but personally i think this problem will be solved not far in the future by some even larger multimodal LLM. But before that any not end-to-end solution like using segmentation is going to take a lot of efforts. I do believe the best solution for now is collecting paired Japanese manga and translated English ones and train model based on that.

zyddnys / manga-image-translator

Finding the area where to put the translated english text #432

What would your feature do?