yuantailing / ctw-baseline

Baseline methods for [CTW dataset](https://ctwdataset.github.io/)
MIT License
329 stars 88 forks source link

您好,能麻烦提供一下你们做的中文字库吗,你们论文中说数据集包括3850个不同的中文字 #5

Closed Banyueqin closed 6 years ago

yuantailing commented 6 years ago

请看一下标注格式,自行从标注中提取。

yuantailing commented 6 years ago

示例代码:

import json

from pythonapi import anno_tools

if __name__ == '__main__':
    s = set()
    with open('../data/annotations/train.jsonl') as f:
        for line in f:
            anno = json.loads(line)
            for char in anno_tools.each_char(anno):
                s.add(char['text'])
    with open('../data/annotations/val.jsonl') as f:
        for line in f:
            anno = json.loads(line)
            for char in anno_tools.each_char(anno):
                s.add(char['text'])
    print(s)
Banyueqin commented 6 years ago

谢谢

pycoco commented 4 years ago

why i just get 3768 characters?

yuantailing commented 4 years ago

why i just get 3768 characters?

Some character categories appear only in the test set.

pycoco commented 4 years ago

thank u, i use above code to generate dict,but i also get key error when training.i don't know why