Issue with EnFr task. Maybe a tokenization problem? #518

gsoul · 2018-01-15T13:44:53Z

I use 1.4.1 version of T2T with 2 * 1080ti and BS of 2048.

When I trained translate_ende_wmt32k with big_single_gpu model for ~ 600k steps, I got a BLEU score of 26.02, which is only ~2 BLEUs less than reported in the "Attention is all you need" paper.

But when I use the same set-up and parameters to train translate_enfr_wmt32k model - I get around 33 BLEU after 1.4mln steps, which is whole 9 BLEU points less than result in the paper. Which seems somewhat too much to be compensated with number of GPUs.

In gitter @lukaszkaiser assumed that it might be a tokenization issue.

I'm not sure what's the issue at this point, but results look somewhat suspicious to me.

Hope you could advise what might be happening here.

martinpopel · 2018-01-15T14:24:52Z

I have no experience with enfr, so cannot really help. It would be nice to create a wiki page for reporting replications of the ende and enfr results with various T2T versions and various hyperparams (esp. number of GPUs, batch size).

That said: enfr has 8 times bigger training data than ende. More GPUs result in higher effective batch size and this may influence (improve) the results esp. in later stages.
Moreover, I guess the "Attention is all you need" paper used batch_size=3072, thus effective batch size 8*3072=24576 (which is "approximately 25000 source tokens and 25000 target tokens"), so after 300k training steps, there were 3072*8*300000 training examples. So with 2GPUs and batch_size=2048, you need 1.8 million training steps to see the same amount of training data. However, according to my experiments this is not enough to reach the same results (perhaps a higher learning rate could compensate for the lower effective batch size).

vince62s · 2018-01-15T15:17:16Z

if you have only 2 GPU, run the base model with 2x4096, you'll get better results.
On 4 GPUx4096 I replicate the paper with 38.2

martinpopel · 2018-01-15T15:27:35Z

On GTX 1080 Ti with 11GB memory, transformer_big_single_gpu and recent version of T2T, I can use at most batch_size=2000 (2050 fails with OOM). I guess this may slightly differ depending on the maximum sentence length in the training data (so 2048 is possible for some datasets, but not much more).
Also, according to my experiments transformer_big_single_gpu with batch_size=1500 is clearly better than transformer_base_single_gpu with batch_size=4500 (after 1 day of training on one GPU, or after 17 hours of training on 8 GPUs). And transformer_big_single_gpu with batch_size=2000 is even better.

apeterswu · 2018-01-16T03:10:37Z

Hi, I try to use transformer_enfr_big setting to run wmt_enfr32k data. On M40 with 24GB memory, it only supports 2000 batch size, I train the model with 8M40x2000.... After training about 90000 steps, the bleu is only 27.x, which is far away from the reported result. That's so hard to reproduce the wmt_enfr32k result.
Besides, for evaluation, I just feed in the en-fr.en test file(raw data), and use the output to match the en-fr.fr(raw data), is there anything wrong?
@lukaszkaiser Could one please give a detailed description about how to reproduce the wmt_enfr32k result? I am struggling to reproduce the enfr result...

martinpopel · 2018-01-16T10:22:48Z

I have one more tip how to increase the batch size: set the max_length parameter to control too long sentences, e.g. `--hparams="batch_size=2300,max_length=100". Without the max_length restriction, I was able to use batch size 2040 at most. With max_length=150, I was able to use 2200.

apeterswu · 2018-01-19T05:52:21Z

@vince62s , would you please kindly share your training setting and evaluation details? What is the hparams setting, training steps, batch size, test raw data or tokenized data, test with multi-bleu.perl? I have tried several times to run transformer_big_enfr and transformer_big setting, but they only got nearly 32.x bleu score, which is far low then the paper, also your 38 result. I am so confused.....Could you please help a little? Appreciate a lot.

vince62s · 2018-01-19T06:29:44Z

For this one, I did nothing special, but I have 4 GPU on which I can fit a batch of 4096 on each.
Take base_V2.
Since the paper runs the base model during 100K steps on 8 GPU, I ran it during 200K steps on 4 GPU.

if you are on 1 GPU you can't expext good results on the big after 90K steps. Start with the base.

apeterswu · 2018-01-19T12:10:07Z

@vince62s Do you mean you use transformer_base() setting? Actually, I run the transformer_big exp on 8 GPUs with batch size 5500. But more than 100k, I still got low BLEU score. Would you kindly may share the training script? Thanks.

apeterswu · 2018-04-14T04:45:18Z

Is there any progress on reproduce En-Fr?

robotzheng · 2018-08-08T02:52:01Z

I report the result of “BLEU_uncased = 35.53
BLEU_cased = 34.53”，8GPUs，batchsize=3125.

robotzheng · 2018-08-09T03:35:30Z

in fairseg， 2 epoches，report the result：
Namespace(beam=4, cpu=False, data='./data-bin/wmt14_en_fr', fp16=False, gen_subset='test', left_pad_source='True', left_pad_target='False', lenpen=0.6, log_format=None, log_interval=1000, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, min_len=1, model_overrides='{}', nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=True, num_shards=1, path='./checkpoints/transformer_vaswani_wmt_en_fr_big/checkpoint_best.pt', prefix_size=0, print_alignment=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, sampling=False, sampling_temperature=1, sampling_topk=-1, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', unkpen=0, unnormalized=False)
| [en] dictionary: 40968 types
| [fr] dictionary: 42472 types
| ./data-bin/wmt14_en_fr test 3003 examples
| ./data-bin/wmt14_en_fr test 3003 examples
| loading model(s) from ./checkpoints/transformer_vaswani_wmt_en_fr_big/checkpoint_best.pt
| Translated 3003 sentences (94316 tokens) in 71.4s (42.09 sentences/s, 1321.79 tokens/s)
| Generate test with beam=4: BLEU4 = 38.64, 66.8/45.0/32.3/23.6 (BP=0.993, ratio=0.993, syslen=82197, reflen=82787)

gsoul mentioned this issue Feb 5, 2018

Performance (accuracy) of a Transformer model is somewhat below expectations OpenNMT/OpenNMT-tf#40

Closed

rsepassi added the bug label Feb 9, 2018

rsepassi mentioned this issue Feb 9, 2018

can not reproduce the result of wmt_enfr32k #528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with EnFr task. Maybe a tokenization problem? #518

Issue with EnFr task. Maybe a tokenization problem? #518

gsoul commented Jan 15, 2018

martinpopel commented Jan 15, 2018 •

edited

Loading

vince62s commented Jan 15, 2018

martinpopel commented Jan 15, 2018 •

edited

Loading

apeterswu commented Jan 16, 2018 •

edited

Loading

martinpopel commented Jan 16, 2018

apeterswu commented Jan 19, 2018

vince62s commented Jan 19, 2018

apeterswu commented Jan 19, 2018

apeterswu commented Apr 14, 2018

robotzheng commented Aug 8, 2018

robotzheng commented Aug 9, 2018

Issue with EnFr task. Maybe a tokenization problem? #518

Issue with EnFr task. Maybe a tokenization problem? #518

Comments

gsoul commented Jan 15, 2018

martinpopel commented Jan 15, 2018 • edited Loading

vince62s commented Jan 15, 2018

martinpopel commented Jan 15, 2018 • edited Loading

apeterswu commented Jan 16, 2018 • edited Loading

martinpopel commented Jan 16, 2018

apeterswu commented Jan 19, 2018

vince62s commented Jan 19, 2018

apeterswu commented Jan 19, 2018

apeterswu commented Apr 14, 2018

robotzheng commented Aug 8, 2018

robotzheng commented Aug 9, 2018

martinpopel commented Jan 15, 2018 •

edited

Loading

martinpopel commented Jan 15, 2018 •

edited

Loading

apeterswu commented Jan 16, 2018 •

edited

Loading