OOM with least batch 2 in train_gpu.py

my env: py3+tf 1.15+2080Ti 11G*4

i wanna train xlnet from scratch, but got OOM with batch size 32, 16 ,8 and 2 under seq_len=512

here is the spec i run

python data_utils.py --bsz_per_host=16 --num_core_per_host=8 --seq_len=512 --reuse_len=256 --input_glob=data/pubmed_300w_line.txt --save_dir=untfrec16 --num_passes=20 --bi_data=True --sp_path=data/pub3m_cased.model --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False --use_tpu=False

python train_gpu.py --record_info_dir=/data/untfrec16/tfrecords/ --model_dir=/data/model/ --train_batch_size=16 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --save_steps=10000

and i got this

2020-06-15 13:25:26.120816: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 134217728 totalling 128.00MiB
2020-06-15 13:25:26.120856: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 221800704 totalling 211.53MiB
2020-06-15 13:25:26.120889: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 10.05GiB
2020-06-15 13:25:26.120924: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 10813302272 memory_limit_: 10813302375 available bytes: 103 curr_region_allocation_bytes_: 17179869184
2020-06-15 13:25:26.120964: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 10813302375
InUse:                 10796987136
MaxInUse:              10797147392
NumAllocs:                    3723
MaxAllocSize:            221800704

2020-06-15 13:25:26.121263: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
2020-06-15 13:25:26.121329: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at transpose_op.cc:198 : Resource exhausted: OOM when allocating tensor with shape[512,896,2,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2020-06-15 13:25:26.328082: W tensorflow/core/kernels/data/cache_dataset_ops.cc:824] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Traceback (most recent call last):
  File "/data/home/zhushanfeng/anaconda3/envs/TXLNet3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/data/home/zhushanfeng/anaconda3/envs/TXLNet3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/data/home/zhushanfeng/anaconda3/envs/TXLNet3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,1408,2,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node model/transformer/layer_12/rel_attn_1/einsum_3/transpose_2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[model/transformer/StopGradient_17/_95]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,1408,2,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node model/transformer/layer_12/rel_attn_1/einsum_3/transpose_2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

well i try smaller batch 8 and 2 in both data_util.py and train_gpu.py, i got the same error but whatever the batch size, i got the same curr_region_allocationbytes: 17179869184 ,so after i tried 4 different batches, i am tied and dont want to try smaller seq_len, i wonder whether there exist memory leak

if anyone succeed in training with gpu of 11GB, please tell me more of your running spec, while most envs i saw in issues are GPU of 32GB

batch 2 shall set the core to 1...........

zihangdai / xlnet

OOM with least batch 2 in train_gpu.py #268