zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.18k stars 1.18k forks source link

Fine Tuning - SQuAD 2.0 on GPU 8 GB #64

Open renatoviolin opened 5 years ago

renatoviolin commented 5 years ago

I've been fine tuning the original xlnet model to run on a 8 GB GPU (RTX 2080).

The modifications make possible to fine tuning the model on SQuAD 2.0 dataset achieving 86.25 F1-Score

Here's the forked repo with the changes: https://github.com/renatoviolin/xlnet

kimiyoung commented 5 years ago

Nice results! Just added a pointer to your repo.

shawei3000 commented 5 years ago

@renatoviolin , I went through the updated files, could you tell me where/how you freezed the layers: 1~11? Thnx!

renatoviolin commented 5 years ago

@shawei3000

In this file https://github.com/renatoviolin/xlnet/blob/master/model_utils_GPU.py

line 142: variables = variables[-177:]

I get all trainable vars, and re-assign only the last 177 vars. The print() on following lines outputs all the trainable variables so that you can see exactly which they are.

For example, if you want to train one more layer, ie, 11, 12, ..., you need to specify

variables = variables[-190:]

Each layer has 13 trainable variables, so each 13 will step down / up the layers.

yangapku commented 5 years ago

Hi @renatoviolin, how to activate FP16 training for your modified code?

renatoviolin commented 5 years ago

Hi @yangapku In this file https://github.com/renatoviolin/xlnet/blob/master/model_utils_GPU.py , lines 147 - 166 is the code to use FP-16. It is already hard-coded as TRUE in the line 147.

yangapku commented 5 years ago

@renatoviolin

Thank you very much! A further question please~ If I changed the settings of training and warmup steps in the config, should I change the params of tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager (in line 150) at the same time?

renatoviolin commented 5 years ago

@yangapku I don't did it yet. In my tests I keep these parameters unchanged to keep track of the improvements I could do in the network without modify the parameters that the XLNet was trained.

Those parameters at line 150 is not related with the ones defined at FLAGS. I adopted those lines from a NVidia implementation of FP16.

Let me know If you achieve better results with another params values.

yangapku commented 5 years ago

Hi @renatoviolin, it seems the modification of FP16 is not compatible with multi-GPU setting? The following error occurs (platform TF1.12, CUDNN v7, CUDA 9):

I0703 12:01:55.457263 140202310260480 tf_logging.py:159] batch_all_reduce invoked for batches size = 0 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0703 12:01:55.457496 140202310260480 tf_logging.py:115] Error reported to Coordinator: list index out of range
Traceback (most recent call last):
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 597, in _batch_reduce
[v[0] for v in value_destination_pairs])
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 618, in _batch_all_reduce
destinations = per_device_values[0].devices
IndexError: list index out of range
Traceback (most recent call last):
File "src/run_squad.py", line 1322, in <module>
tf.app.run()
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "src/run_squad.py", line 1219, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
self.config)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
return self._call_for_each_tower(fn, *args, **kwargs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
return _call_for_each_tower(self, fn, *args, **kwargs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
coord.join(threads)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 597, in _batch_reduce
[v[0] for v in value_destination_pairs])
File "/home/slurm/job/tmp/job-50932/xdevel-tf1.12-cuda-9.0-cudnn7.4-trt5/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 618, in _batch_all_reduce
destinations = per_device_values[0].devices
IndexError: list index out of range

(Only added FP16, with all the params trained)

renatoviolin commented 5 years ago

Hi @yangapku I'm using a single RTX 2080 CUDA 10, CUDNN 7.5, Python 3.7, TF 1.13.1 I don't have experience with multi-gpu. I could take a look at https://github.com/horovod/horovod. It seems this implementation support FP16 and multi-gpu.

MaXXXXfeng commented 5 years ago

Hi @renatoviolin, when I run the script, got the error like this line 222, in get_qa_outputs xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path) AttributeError: module 'xlnet' has no attribute 'XLNetConfig', I didn't change any code of this file , are there something wrong?

renatoviolin commented 5 years ago

Hi @MaXXXXfeng, have you used the original code before? It seems that you have problems with the paths for the files. Check the script gpu_squad_base_GPU.sh for the path variables. Adjust them according to your paths.

local path
SQUAD_DIR=data/squad
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
PROC_DATA_DIR=proc_data/squad
MODEL_DIR=experiment/squad

Suppose your root directory is /home/user

I execute the scripts from the folder /home/user/xlnet $ bash scripts/gpu_squad_base_GPU.sh

MaXXXXfeng commented 5 years ago

Hi @MaXXXXfeng, have you used the original code before? It seems that you have problems with the paths for the files. Check the script gpu_squad_base_GPU.sh for the path variables. Adjust them according to your paths.

local path
SQUAD_DIR=data/squad
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
PROC_DATA_DIR=proc_data/squad
MODEL_DIR=experiment/squad

Suppose your root directory is /home/user

  • Create the folder "/home/user/xlnet" to download all the python files and the scripts folder.
  • Create the folder "/home/user/data/squad" make sure to have these fles "dev-v2.0.json" and "train-v2.0.json" and "evaluate-v2.0.py"
  • Create the folder "/home/user/proc_data/squad" this folder will be used to save the tf.data files during preprocessing squad json files.
  • INIT_CKPT_DIR must point to xlnet pretrained model
  • MODEL_DIR must point to somewhere you want save the checkpoints during fine tuning.

I execute the scripts from the folder /home/user/xlnet $ bash scripts/gpu_squad_base_GPU.sh

Hi @renatoviolin, thanks for your reply, it's very helpful. The reason I thought might because of the GPU since I run same code on 2080, it works. Another tips for anyone who wants run this code is when you create tf_record, you need to modify the max_seq_length to 340 instead of 512。

renatoviolin commented 5 years ago

Hi, @MaXXXXfeng good point about max_seq_length. Any time we change the max_seq_length we have do delete all files of "PROC_DATA_DIR" so that they are recreated with the right length.

harirajeev commented 5 years ago

Hi, @renatoviolin , thanks for your code. Its very helpful. i tried to rerun using your code and got the below results

Result
best_exact 50.08001347595385 | best_exact_thresh -1.2460410594940186 | best_f1 50.08001347595385 | best_f1_thresh -1.2460410594940186 | has_ans_exact 0.08282726045883941 | has_ans_f1 0.11427438585200644 |

Running on a 16GB P100 . Not sure what i am missing. Could your please suggest if you have any clues.

thanks Hari

renatoviolin commented 5 years ago

Hi @harirajeev Sometimes I got results like yours (but I don't remember what I did to solve, because I did so many things). Here is my simplified code folder that I train for 90000 iterations and got the following results: best_f1 86.33 To train, run the scripts/_train.sh (note that you need to adjust the paths.)

Here is the model pre-trained. Download Let me know if it helped.

Renato

renatoviolin commented 5 years ago

Hi Hari,

No, use_bfloat16 = False (line 59, file run_squad_3.py).

I use float16 directly in the code (lines 147-150, file model_utils_3.py).

I think bfloat is specific for TPU usage.

Regards Renato

On Sun, Aug 4, 2019 at 2:25 AM harirajeev notifications@github.com wrote:

Hi @harirajeev https://github.com/harirajeev Sometimes I got results like yours (but I don't remember what I did to solve, because I did so many things). Here is my simplified code folder https://drive.google.com/file/d/1d_XM-mQFIQRlQMO50Io6u6UDRfbABYEi/view?usp=sharing that I train for 90000 iterations and got the following results: best_f1 86.33 To train, run the scripts/_train.sh (note that you need to adjust the paths.)

Here is the model pre-trained. Download https://drive.google.com/open?id=11g7yvyqhmuzvDukiT-Fe4gNid4pcr-vc Let me know if it helped.

Renato

Thank you so much @renatoviolin https://github.com/renatoviolin , Were you using use_bfloat16 as True for training ?.

thanks Hari

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zihangdai/xlnet/issues/64?email_source=notifications&email_token=ACD4J6W6MMHDXD5S74VQJ5LQCZR4PA5CNFSM4H3XEINKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3P2XLI#issuecomment-517974957, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD4J6UOOJBFL47ONAGOJ63QCZR4PANCNFSM4H3XEINA .

harirajeev commented 5 years ago

Thank you @renatoviolin , got the below results ====== RESULT ======== best_exact 83.39088688621241 best_exact_thresh -3.2638230323791504 best_f1 86.23878272238977 best_f1_thresh -2.422508955001831 has_ans_exact 0.8621794871794872 has_ans_f1 0.9295812035245823

harirajeev commented 5 years ago

Hi @renatoviolin , have you tried your model to infer on data outside the squad dataset ?. How was the result like. Thanks Hari

renatoviolin commented 5 years ago

Hi @harirajeev I tried to fine tune BERT on Trivia-QA, but didn't got good results in that case. With XLNET I haven't tried it yet.

desaibhargav commented 4 years ago

Hi @harirajeev Sometimes I got results like yours (but I don't remember what I did to solve, because I did so many things). Here is my simplified code folder that I train for 90000 iterations and got the following results: best_f1 86.33 To train, run the scripts/_train.sh (note that you need to adjust the paths.)

Here is the model pre-trained. Download Let me know if it helped.

Renato

Hey there! Really helpful code and explanations of various issues that might come up! I just wanted to know whether the model link you've attached here is the one you got F1 as 86.33?