Closed vsooda closed 6 years ago
See each_char
in pythonapi/anno_tools.py
, for each block
, just compute bounding box of char['polygon']
(or char['adjusted_bbox']
) and concatenate char['text']
.
It may look like this.
from __future__ import print_function
import json
def each_word(anno):
for block in anno['annotations']:
xx, yy = [], []
s = ''
for char in block:
for xy in char['polygon']:
xx.append(xy[0])
yy.append(xy[1])
if char['is_chinese']:
s += char['text']
yield (min(xx), min(yy), max(xx) - min(xx), max(yy) - min(yy)), s
if __name__ == '__main__':
with open('../data/annotations/train.jsonl') as f:
anno = json.loads(f.readline())
for bbox, s in each_word(anno):
print(bbox, s)
awesome! thank you very much!
thanks for the great dataset.
I looked into the dataset, it is a character-based dataset. and you use detection with different category for recognizing. But my solution is detecting the word bbox then recognizing.
Maybe I can write code to convert the annotation to word format. But it's time consuming. Could you also offer a word level annotation. It maybe much more easy to use for someone like me.