zomux / lanmt

LaNMT: Latent-variable Non-autoregressive Neural Machine Translation with Deterministic Inference
MIT License
79 stars 4 forks source link

Assertion Error for En-De Translation using latent search and pretrained model . #5

Open spprabhu opened 4 years ago

spprabhu commented 4 years ago

Hi Raphael I am getting following results on WMT En-De Translation for

  1. Fast Translation 2.One REfinement step

python run.py --opt_dtok wmt14_ende --use_pretrain --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 15ms, std: 1 BLEU = 22.304035169763978

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 38ms, std: 1 BLEU = 24.147135058514433

They are inline with the specified results.However I am not able to implement the model using Latent Search as it is giving Assertion Error.

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs Traceback (most recent call last): File "run.py", line 207, in assert os.path.exists(pretrained_autoregressive_path) AssertionError

zomux commented 4 years ago

You need to download my pretrained autoregressive models. Please check the Github page, there are the command for downloading them.

zomux commented 4 years ago

BTW, your average decoding times are 15ms and 38ms, faster than my results. Amazing.

spprabhu commented 4 years ago

Hi This are the speeds for Latent Search Version.

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 205ms, std: 89 BLEU = 25.140153770802392

Just for Minor increase in BLEU it takes a lot of time for Decoding.

spprabhu commented 4 years ago

Any Reasons for such Huge gap in Specified vs Real Results.

zomux commented 4 years ago

Did you try a second time?

spprabhu commented 4 years ago

ALthough I am using T4 GPUs

zomux commented 4 years ago

Can you report the results with

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --test --evaluate
spprabhu commented 4 years ago

Sure.

zomux commented 4 years ago

@spprabhu Gonna sleep, let me check the preprocessing part tomorrow (JST TIMEZONE)

spprabhu commented 4 years ago

Ok.

spprabhu commented 4 years ago

Hi These are the results with Latent search without the Teacher Rescoring

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 157ms, std: 67 BLEU = 25.005498749025612

spprabhu commented 4 years ago

I guess there isn,t much improvement using Latent Search and Teacher Rescoring.I need to know what models are you using for your autoregressive methods though.

If we can compare time by using Different Autoregressive Models ( For eg: Transformer,GPT-2,Fairseq )and Benchmark it.

spprabhu commented 4 years ago

Hi Raphael Any Updates.

zomux commented 4 years ago

Hi, I submitted a job for evaluation the decoding again. But I did that many times, I can't be over 100ms. So I guess your GPU may have a low performance, so that multiple latent variables can't be computed simultaneously. This is shown by the result without the Teacher Rescoring.

zomux commented 4 years ago

My autoregressive baseline model is just a normal transformer.

zomux commented 4 years ago

If you want, I can give you my script for training the autoregressive baseline. Note that the definition of the baseline models is already in the code.

zomux commented 4 years ago

@spprabhu Can you try this command? It reduce the number of candidate latents to 10.

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 10 --test --evaluate

spprabhu commented 4 years ago

Ok wait I ll try it

spprabhu commented 4 years ago

Hi This are the following results for 10 candidate latents.Its similar to one with one refine step

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 10 --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 61ms, std: 9 BLEU = 24.64735121624353

By the way what is the default number of latents you are working with.

spprabhu commented 4 years ago

Also did you tried by increasing the number of layers in decoder.If yes what were the results

zomux commented 4 years ago

The default number is 50. V100 GPU can easily process them in the same time, so the decoding is very fast.

Your GPU actually has a capacity of processing 10 latent variables simultaneously in one pass, but not 50 latent variables. You can adjust the number of --opt_Tcandidate_num to find a sweet spot.

zomux commented 4 years ago

I didn't try to tweak the number of layers. I saw ACL papers that increase the encoder layers but not one paper increases the decoder layers.

spprabhu commented 4 years ago

Hi Any Ideas on How you can Further increase the speed of inference without Much Engineering efforts.

  1. For eg What further architectural changes can be made.
  2. Did you try with other optimizers like LAMB,SM3.if yes What were your results.If not what are your thoughts regarding the same.
  3. Also Did you try your other implementation of compressed word embeddings with this model.
spprabhu commented 4 years ago

Also Are you working on making it Fp-16 or Int-8 or Mixed Precision Ready

spprabhu commented 4 years ago

This are the results with latent size 30.Not much DIfference though

python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --opt_Tcandidate_num 30 --test --evaluate [OPTS] Model tag: dtok-wmt14_ende Running on 1 GPUs loading pretrained model in ./mydata/shu_trained_wmt14_ende.pt translating: 100.0%
Average decoding time: 124ms, std: 50 BLEU = 25.13307027819897

zomux commented 4 years ago

I did try to implement the fp16, half precision which seems to be promising. If fp16 can be done, it will be definitely faster. But I was hitting some bottlenecks in implementing fp16 with PyTorch...

zomux commented 4 years ago

I didn't try other optimizers. We tried a lot of architectures but they are not for faster decoding

spprabhu commented 4 years ago

Ok Thanks and about the fp16 version.

spprabhu commented 4 years ago

Can you explain the Bottlenecks in Detail Please

zomux commented 4 years ago

I was actually trying to enable fp16 for training. However, it turns out that in PyTorch, Adam optimizer doesn't work well with fp16.

Then I tried using AMP for fp16, but it turns out that horovod doesn't work that well with AMP. So just gave up.

May be it will be fine to enable fp16 just for the testing code. I don't know whether the quality will be the same as the model was trained with fp32.