Better base latent model

zomux commented 4 years ago

Todo List

[x] enlarge hidden size
- [x] look into the implementation
- [ ] Check Jason T2T code
- [ ] Steal Jason's T2T dataset (perhaps his dataset is better )
- [x] weight average ?

Checklist

decoder layers
kl regulariztion 10e-4
t2t vae_flow
512 latent size
average across time

jaseleephd commented 4 years ago

Previous WMT'14 En->De results without refinement (mean of the prior): 22.5 BLEU ("strong" in Shu et al.) 23.15 BLEU ("Gauss-base" in Lee et al., with latent_dim=256)

zomux commented 4 years ago

Train with hidden size = 512

run_8nodes abcirun.sh python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --train

[valid] len_loss=1.90 len_acc=0.28 loss=30.99 word_acc=0.95 KL_budget=0.76 kl=22.77 tok_kl=0.79 nll=6.33 * (epoch 113, step 93158)

BLEU = 21.2024005116522

BLEU score is not reaching expectation, need fastanneal option

zomux commented 4 years ago

Training with fastanneal

run_2nodes abcirun.sh python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --train --test --evaluate

[valid] len_loss=1.89 len_acc=0.30 loss=23.13 word_acc=0.91 KL_budget=-0.79 kl=10.65 tok_kl=0.37 nll=10.58 * (epoch 21, step 69693)
BLEU = 23.38
With 1 delta refinement step, BLEU = 24.60

jaseleephd commented 4 years ago

Let's keep using the distilled dataset from fairseq (as the "strong" model got 25.3 BLEU with 1 refinement), so it should be pretty good.

jaseleephd commented 4 years ago

[x] Bigger latent_dim
[ ] Average ELBO across T
[ ] Use distilled WMT from T2T model
[ ] KL regularization

Jason's Gauss VAE models : https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/transformer_vae_flow_prior.py#L655-L682

zomux commented 4 years ago

Checklist

[x] decoder layers
[ ] kl regulariztion 10e-4
[ ] t2t vae_flow
[x] 512 latent size
[x] average across time
[ ] much longer training time
[x] fix finetuning
[ ] check length prediction
[ ] select by BLEU
[ ] using more layers 4 / 4 / 6

zomux commented 4 years ago

distilled dataset, ignore longer than 64 tokens

zomux commented 4 years ago

Latent dim = 512

run_2nodes python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --train --test --evaluate

BLEU=23.86
After 1 detla refinement: BLEU=24.85
BLEU will be better with even longer training
Finetuning
- BLEU = 23.86, after delta 1 step, 24.85

zomux commented 4 years ago

More prior, q, and decoding layers (prior 4, q 4, decoder 6 layers)

run_2nodes python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --opt_priorl 4 --opt_decoderl 6 --train --test --evaluate

BLEU=24.12, after 1 delta step, 25.01 , after 2 delta steps: 25.09, after 4 delta steps: 25.10

jaseleephd commented 4 years ago

Also noticed the default num_heads is 4 : https://github.com/zomux/lanmt-ebm/blob/master/run_ebm.py#L82

@zomux what value are you using for WMT experiments?

zomux commented 4 years ago

num_heads=8

run_2nodes python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --opt_priorl 4 --opt_decoderl 6 --opt_heads 8 --train --test --evaluate

BLEU = 23.93, w/ refinement = 24.95

zomux commented 4 years ago

Training for 500k steps

abcirun.sh python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_x5longert rain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --opt_priorl 4 --opt_decoderl 6 --train --test --evaluate

After 200k steps

BLEU =24.61 , w/ 1 refinement = 25.30

zomux commented 4 years ago

num_heads=8, layers = 6/6/6

run_2nodes python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --opt_priorl 6 --opt_decoderl 6 --opt_heads 8 --train --test --evaluate

BLEU=24.06, after 1 refinement 25.03

zomux commented 4 years ago

num_heads=8, layers =6/6/6, 500 steps

./run_2nodes_long.sh abcirun.sh python lanmt/run2.py --root $HOME/data/wmt14_ende_fair --opt_dtok wmt14_fair_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --opt_x5longertrain --opt_hiddensz 512 --opt_embedsz 512 --opt_fastanneal --opt_latentdim 512 --opt_priorl 6 --opt_decoderl 6 --opt_heads 8 --train --test --evaluate

At 300k steps:

BLEU = 25.07 --- refine ---> 25.76

zomux / lanmt-ebm