Open spprabhu opened 4 years ago
You need to download my pretrained autoregressive models. Please check the Github page, there are the command for downloading them.
BTW, your average decoding times are 15ms and 38ms, faster than my results. Amazing.
Hi This are the speeds for Latent Search Version.
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate
[OPTS] Model tag: dtok-wmt14_ende
Running on 1 GPUs
loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt
translating: 100.0%
Average decoding time: 205ms, std: 89
BLEU = 25.140153770802392
Just for Minor increase in BLEU it takes a lot of time for Decoding.
Any Reasons for such Huge gap in Specified vs Real Results.
Did you try a second time?
ALthough I am using T4 GPUs
Can you report the results with
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --test --evaluate
Sure.
@spprabhu Gonna sleep, let me check the preprocessing part tomorrow (JST TIMEZONE)
Ok.
Hi These are the results with Latent search without the Teacher Rescoring
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --test --evaluate
[OPTS] Model tag: dtok-wmt14_ende
Running on 1 GPUs
loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt
translating: 100.0%
Average decoding time: 157ms, std: 67
BLEU = 25.005498749025612
I guess there isn,t much improvement using Latent Search and Teacher Rescoring.I need to know what models are you using for your autoregressive methods though.
If we can compare time by using Different Autoregressive Models ( For eg: Transformer,GPT-2,Fairseq )and Benchmark it.
Hi Raphael Any Updates.
Hi, I submitted a job for evaluation the decoding again. But I did that many times, I can't be over 100ms. So I guess your GPU may have a low performance, so that multiple latent variables can't be computed simultaneously. This is shown by the result without the Teacher Rescoring.
My autoregressive baseline model is just a normal transformer.
If you want, I can give you my script for training the autoregressive baseline. Note that the definition of the baseline models is already in the code.
@spprabhu Can you try this command? It reduce the number of candidate latents to 10.
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 10 --test --evaluate
Ok wait I ll try it
Hi This are the following results for 10 candidate latents.Its similar to one with one refine step
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 10 --test --evaluate
[OPTS] Model tag: dtok-wmt14_ende
Running on 1 GPUs
loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt
translating: 100.0%
Average decoding time: 61ms, std: 9
BLEU = 24.64735121624353
By the way what is the default number of latents you are working with.
Also did you tried by increasing the number of layers in decoder.If yes what were the results
The default number is 50. V100 GPU can easily process them in the same time, so the decoding is very fast.
Your GPU actually has a capacity of processing 10 latent variables simultaneously in one pass, but not 50 latent variables. You can adjust the number of --opt_Tcandidate_num
to find a sweet spot.
I didn't try to tweak the number of layers. I saw ACL papers that increase the encoder layers but not one paper increases the decoder layers.
Hi Any Ideas on How you can Further increase the speed of inference without Much Engineering efforts.
Also Are you working on making it Fp-16 or Int-8 or Mixed Precision Ready
This are the results with latent size 30.Not much DIfference though
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 30 --test --evaluate
[OPTS] Model tag: dtok-wmt14_ende
Running on 1 GPUs
loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt
translating: 100.0%
Average decoding time: 124ms, std: 50
BLEU = 25.13307027819897
I did try to implement the fp16, half precision which seems to be promising. If fp16 can be done, it will be definitely faster. But I was hitting some bottlenecks in implementing fp16 with PyTorch...
I didn't try other optimizers. We tried a lot of architectures but they are not for faster decoding
Ok Thanks and about the fp16 version.
Can you explain the Bottlenecks in Detail Please
I was actually trying to enable fp16 for training. However, it turns out that in PyTorch, Adam optimizer doesn't work well with fp16.
Then I tried using AMP for fp16, but it turns out that horovod doesn't work that well with AMP. So just gave up.
May be it will be fine to enable fp16 just for the testing code. I don't know whether the quality will be the same as the model was trained with fp32.
Hi Raphael I am getting following results on WMT En-De Translation for
python run.py --opt_dtok wmt14_ende --use_pretrain --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 15ms, std: 1 BLEU = 22.304035169763978
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 38ms, std: 1 BLEU = 24.147135058514433
They are inline with the specified results.However I am not able to implement the model using Latent Search as it is giving Assertion Error.
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs Traceback (most recent call last): File "run.py", line 207, in
assert os.path.exists(pretrained_autoregressive_path)
AssertionError