Cache problem during pretraining

rkcalnode commented 5 years ago

Durinig pretraining, after saving checkpoint below error occurs.

I0712 06:47:22.892611 140596004366080 tf_logging.py:115] [99000] | gnorm 0.71 lr 0.000001 | loss 7.25 | pplx 1408.25, bpc 10.4597
I0712 07:13:05.624328 140596004366080 tf_logging.py:115] [100000] | gnorm 1.03 lr 0.000000 | loss 7.25 | pplx 1406.88, bpc 10.4583
I0712 07:13:34.885596 140596004366080 tf_logging.py:115] Model saved in path: /home/xlnet_exam/models_wiki_ja/model.ckpt
2019-07-12 07:13:34.961923: W tensorflow/core/kernels/data/cache_dataset_ops.cc:770] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

in data_utils.py, dataset cache order seems to be the same as above waring suggestion, (but note exists.)

...
  # (zihang): since we are doing online preprocessing, the parsed result of
  # the same input at each time will be different. Thus, cache processed data
  # is not helpful. It will use a lot of memory and lead to contrainer OOM.
  # So, change to cache non-parsed raw data instead.
  dataset = dataset.cache().map(parser).repeat()
  dataset = dataset.batch(bsz_per_core, drop_remainder=True)
  dataset = dataset.prefetch(num_core_per_host * bsz_per_core)
...

my env

GPU: Tesla V100 32GB *4
CUDA_VERSION: 9.0.176
TENSORFLOW_VERSION: 1.11.0

my pretrain command

python train_gpu.py \
      --record_info_dir=${TFRECORD_DIR} \
      --num_core_per_host=1 \
      --train_batch_size=4 \
      --save_steps=10000 \
      --model_dir=${MODEL_DIR} \
      --seq_len=512 \
      --reuse_len=256 \
      --mem_len=384 \
      --perm_size=256 \
      --n_layer=24 \
      --d_model=1024 \
      --d_embed=1024 \
      --n_head=16 \
      --d_head=64 \
      --d_inner=4096 \
      --untie_r=True \
      --mask_alpha=6 \
      --mask_beta=1 \
      --num_predict=85 \
      --uncased=True

Is it okay? or problem?

ymcui commented 5 years ago

Maybe you should try newer TF. The official used TensorFlow 1.13.1.

rkcalnode commented 5 years ago

@ymcui Thanks a lot! I'll try with TensorFlow 1.13.1.

rkcalnode commented 5 years ago

I tried with Tensorflow 1.13.1 and 1.4.0 but it still occurs. And output files are broken.

freefuiiismyname commented 5 years ago

same problem.:(

xavinatalia commented 4 months ago

same problem...did you solve it?

zihangdai / xlnet

Cache problem during pretraining #163