Closed man0007 closed 4 years ago
Following the original BERT training, we set a maximum sentence length of 128 tokens, and train the model until the training loss starts to converge. We then continue training the model allowing sentence lengths up to 512 tokens.
Which one would give better accuracy? max-seq 256 or 512?
Following the original BERT training, we set a maximum sentence length of 128 tokens, and train the model until the training loss starts to converge. We then continue training the model allowing sentence lengths up to 512 tokens.