yuantailing / ctw-baseline

Baseline methods for [CTW dataset](https://ctwdataset.github.io/)
MIT License
330 stars 88 forks source link

word level annotations? #1

Closed vsooda closed 6 years ago

vsooda commented 6 years ago

thanks for the great dataset.

I looked into the dataset, it is a character-based dataset. and you use detection with different category for recognizing. But my solution is detecting the word bbox then recognizing.

Maybe I can write code to convert the annotation to word format. But it's time consuming. Could you also offer a word level annotation. It maybe much more easy to use for someone like me.

yuantailing commented 6 years ago

See each_char in pythonapi/anno_tools.py, for each block, just compute bounding box of char['polygon'] (or char['adjusted_bbox']) and concatenate char['text'].

yuantailing commented 6 years ago

It may look like this.

from __future__ import print_function
import json

def each_word(anno):
    for block in anno['annotations']:
        xx, yy = [], []
        s = ''
        for char in block:
            for xy in char['polygon']:
                xx.append(xy[0])
                yy.append(xy[1])
            if char['is_chinese']:
                s += char['text']
        yield (min(xx), min(yy), max(xx) - min(xx), max(yy) - min(yy)), s

if __name__ == '__main__':
    with open('../data/annotations/train.jsonl') as f:
        anno = json.loads(f.readline())
    for bbox, s in each_word(anno):
        print(bbox, s)
vsooda commented 6 years ago

awesome! thank you very much!