您好，能麻烦提供一下你们做的中文字库吗，你们论文中说数据集包括3850个不同的中文字

yuantailing / ctw-baseline

Baseline methods for [CTW dataset](https://ctwdataset.github.io/)

MIT License

329 stars 88 forks source link

您好，能麻烦提供一下你们做的中文字库吗，你们论文中说数据集包括3850个不同的中文字 #5

Closed Banyueqin closed 6 years ago

yuantailing commented 6 years ago

请看一下标注格式，自行从标注中提取。

yuantailing commented 6 years ago

示例代码：

import json

from pythonapi import anno_tools

if __name__ == '__main__':
    s = set()
    with open('../data/annotations/train.jsonl') as f:
        for line in f:
            anno = json.loads(line)
            for char in anno_tools.each_char(anno):
                s.add(char['text'])
    with open('../data/annotations/val.jsonl') as f:
        for line in f:
            anno = json.loads(line)
            for char in anno_tools.each_char(anno):
                s.add(char['text'])
    print(s)

Banyueqin commented 6 years ago

谢谢

pycoco commented 4 years ago

why i just get 3768 characters?

yuantailing commented 4 years ago

why i just get 3768 characters?

Some character categories appear only in the test set.

pycoco commented 4 years ago

thank u, i use above code to generate dict,but i also get key error when training.i don't know why