使用GPU训练时报错了，请大佬帮帮我

zdx1012 commented 5 years ago

Caused by op 'bert/encoder/layer_8/attention/self/Mul', defined at:
  File "model.py", line 503, in <module>
    model.train()
  File "model.py", line 316, in train
    self.__creat_model()
  File "model.py", line 42, in __creat_model
    self.bert_layer()
  File "model.py", line 130, in bert_layer
    use_one_hot_embeddings=False
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/bert_base/bert/modeling.py", line 217, in __init__
    do_return_all_layers=True)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/bert_base/bert/modeling.py", line 846, in transformer_model
    to_seq_length=seq_length)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/bert_base/bert/modeling.py", line 705, in attention_layer
    1.0 / math.sqrt(float(size_per_head)))
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 248, in multiply
    return gen_math_ops.mul(x, y, name)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5860, in mul
    "Mul", x=x, y=y, name=name)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5,12,484,484] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node bert/encoder/layer_8/attention/self/Mul (defined at /home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/bert_base/bert/modeling.py:705) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[node logits/Reshape (defined at model.py:199) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

``

yanwii commented 5 years ago

内存爆了，把batch_size调小试试

zdx1012 commented 5 years ago

是显存爆了吧！设置哪个地方的batch_size

yanwii commented 5 years ago

model.py中 280行或者285行，看你使用的是什么模型

zdx1012 commented 5 years ago

        self.train_data = BertDataUtils(tokenizer, batch_size=5)
        self.dev_data = BertDataUtils(tokenizer, batch_size=10)

zdx1012 commented 5 years ago

我改成3 试一下

zdx1012 commented 5 years ago

2019-06-24 14:57:37.585939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5706 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From model.py:468: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
[->] restore model
2019-06-24 14:57:38.644984: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
Traceback (most recent call last):
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
     [[{{node save/RestoreV2}}]]
     [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 505, in <module>
    model.predict()
  File "model.py", line 470, in predict
    self.saver.restore(sess, ckpt.model_checkpoint_path)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
     [[node save/RestoreV2 (defined at model.py:230) ]]
     [[node save/RestoreV2 (defined at model.py:230) ]]

Caused by op 'save/RestoreV2', defined at:
  File "model.py", line 505, in <module>
    model.predict()
  File "model.py", line 465, in predict
    self.__creat_model()
  File "model.py", line 58, in __creat_model
    self.bert_optimizer_layer()
  File "model.py", line 230, in bert_optimizer_layer
    self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
    restore_sequentially)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
    name=name)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/.virtualenvs/ten_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): Checksum does not match: stored 4265858936 vs. calculated on the restored bytes 771062205
     [[node save/RestoreV2 (defined at model.py:230) ]]
     [[node save/RestoreV2 (defined at model.py:230) ]]

zdx1012 commented 5 years ago

self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)

大佬，预测的时候出现上面这个异常，这个怎么处理啊

zdx1012 commented 5 years ago

我使用的是1060的6G显存的版本，我刚刚测试将batch_size改为4，也oom了，大佬的batch_size设置为5，用的是什么显卡

yanwii commented 5 years ago

bert 还是很费内存的，当时用的2080

zdx1012 commented 5 years ago

emmm，我使用我自己标注的文本又出问题了，有时间帮我看看这个是什么原因可好，感谢

Traceback (most recent call last):
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
         [[{{node bert/embeddings/assert_less_equal/Assert/Assert}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 504, in <module>
    model.train()
  File "model.py", line 347, in train
    sess, batch
  File "model.py", line 273, in bert_step
    [self.embedded, self.global_steps, self.loss, self.train_op, self.logits, self.accuracy, self.length], feed_dict=feed)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
         [[node bert/embeddings/assert_less_equal/Assert/Assert (defined at D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py:492) ]]

Caused by op 'bert/embeddings/assert_less_equal/Assert/Assert', defined at:
  File "model.py", line 504, in <module>
    model.train()
  File "model.py", line 316, in train
    self.__creat_model()
  File "model.py", line 42, in __creat_model
    self.bert_layer()
  File "model.py", line 130, in bert_layer
    use_one_hot_embeddings=False
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py", line 195, in __init__
    dropout_prob=config.hidden_dropout_prob)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py", line 492, in embedding_postprocessor
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\check_ops.py", line 868, in assert_less_equal
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\util\tf_should_use.py", line 193, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 160, in Assert
    return gen_logging_ops._assert(condition, data, summarize, name="Assert")
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 72, in _assert
    name=name)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
    op_def=op_def)
  File "D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x <= y did not hold element-wise:x (bert/embeddings/strided_slice_3:0) = ] [531] [y (bert/embeddings/assert_less_equal/y:0) = ] [512]
         [[node bert/embeddings/assert_less_equal/Assert/Assert (defined at D:\ProgramFiles\Anaconda3\envs\tf\lib\site-packages\bert_base\bert\modeling.py:492) ]]

yanwii commented 5 years ago

BERT 最大数长度支持512，你的输入字符长度是531，太长了报错

zdx1012 commented 5 years ago

如果是长文本的话，我只能把这个截短吗

yanwii commented 5 years ago

是的，可以分段或分句

zdx1012 commented 5 years ago

那可以预测长文本吗？

yanwii commented 5 years ago

你可以改一下模型，分段输入bert然后拼起来，或者直接分段。

zdx1012 commented 5 years ago

好的，感谢。我再去研究下

yanwii / ChineseNER

使用GPU训练时报错了，请大佬帮帮我 #7