请问读取数据集内存占用过高的问题

yongzhuo / Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee

https://blog.csdn.net/rensihui

Apache License 2.0

328 stars 52 forks source link

请问读取数据集内存占用过高的问题 #9

Open ykallan opened 1 year ago

ykallan commented 1 year ago

进行文本多标签分类，数据有90多万，txt文件有不到200m，但是读取数据集占用的内存太多了，不知道是不是bug还是本来就这样，机子32g的内存都不够读取四分之一的数据，

yongzhuo commented 1 year ago

文本长度多少？这个是一次加载全量数据的，如果不能，得改成yield的形式

ykallan commented 1 year ago

文本长度多少？这个是一次加载全量数据的，如果不能，得改成yield的形式

文本长度大概是64左右，我设置了max_len = 64

晚点我试一下yield

yongzhuo commented 1 year ago

text=64应该不至于，是label数太多的原因？该项目源码数据预处理是默认转成onehot的，可以把label转成onehot的操作置于data_collator，或者是用Sparse损失函数