xuyige / BERT4doc-Classification

Code and source for paper ``How to Fine-Tune BERT for Text Classification?``
Apache License 2.0
608 stars 99 forks source link

OOM when batchSize=1 #13

Open chen3082 opened 3 years ago

chen3082 commented 3 years ago

Hi, thanks for your great work. While running run_pretraining.py, I kept getting OOM for any size of the matrix. I already reduce the batch size to 1 but didn't help. I'm using 960M, TensorFlow-gpu1.10, Cuda toolkit 9.0 I'm wondering what version of TensorFlow are you using? Any thoughts on this issue? Thanks in advance.

addiu commented 3 years ago

Hi, I tried run_pretraining.py recently, works fine to me. I'm using tensorflow-gpu=1.15.0, cudatoolkit=10.0. First, I think that 960M has very limited VRAM, that could cause your issue. Second, make sure that you use the same setting when running create_pretraining_data.py and run_pretraining.py. I had set once max_seq_length=512 in create_pretraining_data.py, but max_seq_length=128 in run_pretraining.py. That will also break the code, but not because of the OOM, I think.

xuyige commented 3 years ago

sorry for the late answer

as shown above, 960M may have very limited memory. a GPU with 12G memory can only contain batch size=6 if max_seq_len=512. so please reduce your max sequence length or improve your GPU, thank you!

xuyige commented 3 years ago

Hi, I tried run_pretraining.py recently, works fine to me. I'm using tensorflow-gpu=1.15.0, cudatoolkit=10.0. First, I think that 960M has very limited VRAM, that could cause your issue. Second, make sure that you use the same setting when running create_pretraining_data.py and run_pretraining.py. I had set once max_seq_length=512 in create_pretraining_data.py, but max_seq_length=128 in run_pretraining.py. That will also break the code, but not because of the OOM, I think.

thank you for your issue

could you please show more detail about your error? otherwise, I forgot which version of tenserflow we used, but following the official bert repo, I suggest you trying to downgrade your tensorflow version (the official repo shows tensorflow-gpu >= 1.11.0, so maybe 1.11 or 1.12 can solve your problem)