zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.18k stars 1.18k forks source link

AssertionError in pretrain an XLNet . train_gpu.py. #196

Open MissMcFly opened 5 years ago

MissMcFly commented 5 years ago

python train_gpu.py --corpus_info_path=G:/XLNetData/tftest/corpus_info.json --record_info_dir="G:/XLNetData/tftest/tfrecords" --model_dir="" --train_batch_size=8 --seq_len=128 --reuse_len=64 --mem_len=96 --perm_size=32 --n_layer=6 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=21 --uncased=true --num_hosts=1 --num_core_per_host=1

I got this error immediately:

I0730 15:41:08.028456 8780 tf_logging.py:115] n_token 32000 I0730 15:41:08.029418 8780 tf_logging.py:115] Use the following tfrecord dirs: ['G:/XLNetData/tftest/tfrecords'] I0730 15:41:08.030418 8780 tf_logging.py:115] [0] Record glob: G:/XLNetData/tftest/tfrecords\record_info-train-*.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.json I0730 15:41:08.033409 8780 tf_logging.py:115] [0] Num of record info path: 1 I0730 15:41:08.034406 8780 tf_logging.py:115] [Dir 0] Number of chosen batches: 97922 I0730 15:41:08.034406 8780 tf_logging.py:115] [Dir 0] Number of chosen files: 1 I0730 15:41:08.034406 8780 tf_logging.py:115] ['G:/XLNetData/tftest/tfrecords\train-0-0.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.tfrecords'] I0730 15:41:08.034406 8780 tf_logging.py:115] Total number of batches: 97922 I0730 15:41:08.035402 8780 tf_logging.py:115] Total number of files: 1 I0730 15:41:08.035402 8780 tf_logging.py:115] ['G:/XLNetData/tftest/tfrecords\train-0-0.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.tfrecords'] I0730 15:41:08.035402 8780 tf_logging.py:115] num of batches 97922 I0730 15:41:08.035402 8780 tf_logging.py:115] Host 0 handles 1 files I0730 15:41:08.245867 8780 tf_logging.py:115] label: Tensor("Cast_6:0", shape=(1,), dtype=int32) I0730 15:41:08.245867 8780 tf_logging.py:115] seg_id: Tensor("Cast_7:0", shape=(128,), dtype=int32) I0730 15:41:08.247835 8780 tf_logging.py:115] target_mapping: Tensor("Reshape_4:0", shape=(21, 128), dtype=float32) I0730 15:41:08.248839 8780 tf_logging.py:115] target: Tensor("Cast_8:0", shape=(21,), dtype=int32) I0730 15:41:08.257808 8780 tf_logging.py:115] target_mask: Tensor("Reshape_6:0", shape=(21,), dtype=float32) I0730 15:41:08.259802 8780 tf_logging.py:115] perm_mask: Tensor("Reshape_7:0", shape=(128, 128), dtype=float32) I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32) I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32) I0730 15:41:08.414390 8780 tf_logging.py:115] memory input [<tf.Tensor 'Placeholder:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_1:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_2:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_3:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_4:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_5:0' shape=(96, 8, 1024) dtype=float32>] I0730 15:41:08.414390 8780 tf_logging.py:115] Use float type <dtype: 'float32'> Traceback (most recent call last): File "train_gpu.py", line 328, in tf.app.run() File "D:\anconda\install\envs\TensorflowGpu36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 241, in train mems=mems_i) File "train_gpu.py", line 162, in single_core_graph is_training=is_training) File "train_gpu.py", line 138, in model_fn FLAGS, features, labels, mems, is_training) File "E:\PyProject\xlnet-master0719\function_builder.py", line 130, in get_loss return two_stream_loss(FLAGS, features, labels, mems, is_training) File "E:\PyProject\xlnet-master0719\function_builder.py", line 90, in two_stream_loss inp_q=inp_q) File "E:\PyProject\xlnet-master0719\xlnet.py", line 222, in init ) = modeling.transformer_xl(**tfm_args) File "E:\PyProject\xlnet-master0719\modeling.py", line 567, in transformer_xl bsz=bsz, dtype=tf_float) File "E:\PyProject\xlnet-master0719\modeling.py", line 236, in relative_positional_encoding assert bsz%2 == 0 AssertionError

What should I do? Thank you!

langfield commented 5 years ago

Can you print out inp_k (and its shape) somewhere and post the output? Maybe in the two_stream_loss function on line 44 of function_builder.py?

MissMcFly commented 5 years ago

@brendanxwhitaker

I have added “print(inp_k) print(inp_k.shape)” in the two_stream_loss function,but it didn't print out. And I got this information: I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32) I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32)

langfield commented 5 years ago

That’s interesting. Sorry I probably should have specified but you might want to print the evaluated tensor or cast it to a numpy array to print nicely.

Is that function being called at all? Does a generic print statement execute? I was following your stack trace and I thought I saw a call to that function.

I’m trying to figure out if something funky is happening when it reshapes the inputs after grabbing from the feature dict. I think it’s flattened before being added to tfrecords, and then unflattened again for training. The unflattened shape of inp_k is used to set bsz, which somehow is odd for that assert statement (if it’s an integer).

But it doesn’t throw an error? So the variable exists. Perhaps it’s None.

ft3020997 commented 5 years ago

@brendanxwhitaker

I have added “print(inp_k) print(inp_k.shape)” in the two_stream_loss function,but it didn't print out. And I got this information: I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32) I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32)

When I used TF1.4 and PY3.7, I had met the same issue and corrected it as below: In modeling.py line 470, modified “ bsz = tf.shape(inp_k)[1] ” to " bsz = inp_k.get_shape()[1] "

And it works.

MissMcFly commented 5 years ago

@ft3020997 According to your method, it works!Thank you very much.

illuminascent commented 5 years ago

You are getting this error because this assertion is not implemented properly. bsz in relative_positional_encoding is inferred from the shape of the input, this makes it a tensor. And % is not meant for tensors, which is why it always fails the assertion every single time. You can either just delete this thing, or rewrite it using tf.assert_equal and tf.control_dependencies. If you do plan to delete it, make sure that per core batch size is divisible by 2 or strange error will begin to show up.