Closed zdx1012 closed 5 years ago
内存爆了,把batch_size调小试试
是显存爆了吧! 设置哪个地方的batch_size
model.py
中 280行或者285行,看你使用的是什么模型
self.train_data = BertDataUtils(tokenizer, batch_size=5)
self.dev_data = BertDataUtils(tokenizer, batch_size=10)
我改成3 试一下
2019-06-24 14:57:37.585939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5706 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From model.py:468: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
[->] restore model
2019-06-24 14:57:38.644984: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
Traceback (most recent call last):
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
[[{{node save/RestoreV2}}]]
[[{{node save/RestoreV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model.py", line 505, in <module>
model.predict()
File "model.py", line 470, in predict
self.saver.restore(sess, ckpt.model_checkpoint_path)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
[[node save/RestoreV2 (defined at model.py:230) ]]
[[node save/RestoreV2 (defined at model.py:230) ]]
Caused by op 'save/RestoreV2', defined at:
File "model.py", line 505, in <module>
model.predict()
File "model.py", line 465, in predict
self.__creat_model()
File "model.py", line 58, in __creat_model
self.bert_optimizer_layer()
File "model.py", line 230, in bert_optimizer_layer
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
self.build()
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
DataLossError (see above for traceback): Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
[[node save/RestoreV2 (defined at model.py:230) ]]
[[node save/RestoreV2 (defined at model.py:230) ]]
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
大佬,预测的时候出现上面这个异常,这个怎么处理啊
我使用的是1060的6G显存的版本,我刚刚测试将batch_size改为4,也oom了,大佬的batch_size设置为5,用的是什么显卡
bert 还是很费内存的,当时用的2080
emmm,我使用我自己标注的文本又出问题了,有时间帮我看看这个是什么原因可好,感谢
Traceback (most recent call last):
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
[[{{node bert/embeddings/assert_less_equal/Assert/Assert}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model.py", line 504, in <module>
model.train()
File "model.py", line 347, in train
sess, batch
File "model.py", line 273, in bert_step
[self.embedded, self.global_steps, self.loss, self.train_op, self.logits, self.accuracy, self.length], feed_dict=feed)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
[[node bert/embeddings/assert_less_equal/Assert/Assert (defined at D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py:492) ]]
Caused by op 'bert/embeddings/assert_less_equal/Assert/Assert', defined at:
File "model.py", line 504, in <module>
model.train()
File "model.py", line 316, in train
self.__creat_model()
File "model.py", line 42, in __creat_model
self.bert_layer()
File "model.py", line 130, in bert_layer
use_one_hot_embeddings=False
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py", line 195, in __init__
dropout_prob=config.hidden_dropout_prob)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py", line 492, in embedding_postprocessor
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\check_ops.py", line 868, in assert_less_equal
return control_flow_ops.Assert(condition, data, summarize=summarize)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\util\tf_should_use.py", line 193, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 160, in Assert
return gen_logging_ops._assert(condition, data, summarize, name="Assert")
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 72, in _assert
name=name)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
[[node bert/embeddings/assert_less_equal/Assert/Assert (defined at D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py:492) ]]
BERT 最大数长度支持512,你的输入字符长度是531,太长了报错
如果是长文本的话,我只能把这个截短吗
是的,可以分段或分句
那可以预测长文本吗?
你可以改一下模型,分段输入bert然后拼起来,或者直接分段。
好的,感谢。 我再去研究下