博主好，关于训练过程中遇到的一些问题，请指点一下，多谢啦！

zjz5250 commented 4 years ago

@zhang0jhon 博主您好，首先特别感谢您做的工作，您开源的模型，效果确实很好。我想尝试复现一下训练流程，但遇到如下3个问题： 1）速度特别慢，我只用了LSVT的数据，一个epoch都要大约6个小时 2）我尝试用多卡训练，但与单卡速度相当，我用的2080ti的卡 3）我测试了30个epoch后的效果，识别精度很差想请教下： 1）模型训练，需要多少个epoch才合适，初始lr，还有batchsize的大小 2）您在多卡下也是这么慢吗？有没有提升训练速度的方法 3）lsvt中弱标注的数据怎么使用呢，没有文字区域的坐标，如何做mask处理多谢啦！！

etatbak commented 4 years ago

Did you select your gpu at config file? It should be around 10-15 mins per epoch. @zjz5250

zjz5250 commented 4 years ago

@ etatbak Thank you！ yes，i use gpu，set gpus = [0] in config.py. and how many steps of one epoch when you train the model. I found that it cost a lot of time when read images in every step. I set batchsize as 16, and it need about 2 seconds when read 16 images.

zjz5250 commented 4 years ago

@etatbak do you change the steps_per_epoch's value? the default value is 1500, but actually it should be a big number。 for example，if the total number of the training set is 16000，bachsize is 16，then the steps_per_epoch should be 1000，am I right？

zjz5250 commented 4 years ago

I use the lstv dataset，the total number is 238790，I set batchsize as 16，so the steps_per_epoch is 14924. when I train the model， I found that one epoch need about 6 hours。what is worse，after 11 epoches，the model can not work at all。

etatbak commented 4 years ago

@zjz5250 I didn't change many parameters. But I only used rects dataset, so I think if I use lstv it will also take longer. Step_per_epoch is 500 I think so. My batch_size is 10. I trained 1000 epochs but it doesn't work well, even not at average, I am not sure how to improve the performance.

zjz5250 commented 4 years ago

@ etatbak
did you use bp file transform from your new model，when you test the accuracy？

“You must feed a value for placeholder tensor 'label' with dtype int32 and shape [?,33]” did you meet this problem？ and how to fix it

JianYang93 commented 4 years ago

@zjz5250 @etatbak @zhang0jhon Hi, I used all ReCTS, ArT, LSVT and IC2017MLT data and trained for 5 epochs on a single GPU (takes a day). I got training loss around 2 but very high validation loss. Do you have any idea on this?

JianYang93 commented 4 years ago

@zhang0jhon Could you please share what level of training and validation loss did you get with the final model? Thanks!

ustczhouyu commented 4 years ago

@zjz5250 您好，我训练的时候报错，没有icdar_datasets.npy，您方便把这个文件发到我的邮箱 zhou19920226@126.com给我吗，感激不尽.

ustczhouyu commented 4 years ago

@zhang0jhon Hello, thank you for sharing the codes. I fail to train the model, can you send me the icdar_datasets.npy to my email: zhou19920226@126.com ? Thank you very much.

JianYang93 commented 4 years ago

@ustczhouyu Hi, you will need to run dataset.py first to generate the npy file

JianYang93 commented 4 years ago

I got a validation loss around 1.3. The model can recognize some part of the text but the overall accuracy is relatively poor. I checked the pretrained recognition model has a loss around 0.5 so that should be the goal.

whereitogo commented 3 years ago

@zhang0jhon 博主您好，首先特别感谢您做的工作，您开源的模型，效果确实很好。我想尝试复现一下训练流程，但遇到如下3个问题： 1）速度特别慢，我只用了LSVT的数据，一个epoch都要大约6个小时 2）我尝试用多卡训练，但与单卡速度相当，我用的2080ti的卡 3）我测试了30个epoch后的效果，识别精度很差想请教下： 1）模型训练，需要多少个epoch才合适，初始lr，还有batchsize的大小 2）您在多卡下也是这么慢吗？有没有提升训练速度的方法 3）lsvt中弱标注的数据怎么使用呢，没有文字区域的坐标，如何做mask处理多谢啦！！

我觉得应该改变读取数据的方式，我看作者的数据读取方式是将整个图像load，这太慢了，我准备改一下改成load裁剪之后的图像

xianzhe-741 commented 3 years ago

@zhang0jhon 博主您好，首先特别感谢您做的工作，您开源的模型，效果确实很好。我想尝试复现一下训练流程，但遇到如下3个问题： 1）速度特别慢，我只用了LSVT的数据，一个epoch都要大约6个小时 2）我尝试用多卡训练，但与单卡速度相当，我用的2080ti的卡 3）我测试了30个epoch后的效果，识别精度很差想请教下： 1）模型训练，需要多少个epoch才合适，初始lr，还有batchsize的大小 2）您在多卡下也是这么慢吗？有没有提升训练速度的方法 3）lsvt中弱标注的数据怎么使用呢，没有文字区域的坐标，如何做mask处理多谢啦！！

你好，我使用过程中有两个问题请教一下：

test.py过程中使用作者docker中的模型text_recognition5435.pb，在 = tf.import_graph_def(graph_def, name='')时报错 InvalidArgumentError (see above for traceback): The second input must be a scalar, but it has shape [1,33] 2.在train.py时报错 File "/usr/local/lib/python3.5/dist-packages/tensorpack/train/config.py", line 119, in init assert_type(model, ModelDescBase, 'model') File "/usr/local/lib/python3.5/dist-packages/tensorpack/train/config.py", line 107, in assert_type name, tp.name, v.class.name) AssertionError: model has to be type 'ModelDescBase', but an object of type 'AttentionOCR' found.

zhang0jhon / AttentionOCR

博主好，关于训练过程中遇到的一些问题，请指点一下，多谢啦！ #71