Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Issue with EnFr task. Maybe a tokenization problem? #518

Open
gsoul opened this issue Jan 15, 2018 · 11 comments
Open

Issue with EnFr task. Maybe a tokenization problem? #518

gsoul opened this issue Jan 15, 2018 · 11 comments
Labels

Comments

@gsoul
Copy link

gsoul commented Jan 15, 2018

I use 1.4.1 version of T2T with 2 * 1080ti and BS of 2048.

When I trained translate_ende_wmt32k with big_single_gpu model for ~ 600k steps, I got a BLEU score of 26.02, which is only ~2 BLEUs less than reported in the "Attention is all you need" paper.

But when I use the same set-up and parameters to train translate_enfr_wmt32k model - I get around 33 BLEU after 1.4mln steps, which is whole 9 BLEU points less than result in the paper. Which seems somewhat too much to be compensated with number of GPUs.

In gitter @lukaszkaiser assumed that it might be a tokenization issue.

I'm not sure what's the issue at this point, but results look somewhat suspicious to me.

Hope you could advise what might be happening here.

@martinpopel
Copy link
Contributor

martinpopel commented Jan 15, 2018

I have no experience with enfr, so cannot really help. It would be nice to create a wiki page for reporting replications of the ende and enfr results with various T2T versions and various hyperparams (esp. number of GPUs, batch size).

That said: enfr has 8 times bigger training data than ende. More GPUs result in higher effective batch size and this may influence (improve) the results esp. in later stages.
Moreover, I guess the "Attention is all you need" paper used batch_size=3072, thus effective batch size 8*3072=24576 (which is "approximately 25000 source tokens and 25000 target tokens"), so after 300k training steps, there were 3072*8*300000 training examples. So with 2GPUs and batch_size=2048, you need 1.8 million training steps to see the same amount of training data. However, according to my experiments this is not enough to reach the same results (perhaps a higher learning rate could compensate for the lower effective batch size).

@vince62s
Copy link
Contributor

if you have only 2 GPU, run the base model with 2x4096, you'll get better results.
On 4 GPUx4096 I replicate the paper with 38.2

@martinpopel
Copy link
Contributor

martinpopel commented Jan 15, 2018

On GTX 1080 Ti with 11GB memory, transformer_big_single_gpu and recent version of T2T, I can use at most batch_size=2000 (2050 fails with OOM). I guess this may slightly differ depending on the maximum sentence length in the training data (so 2048 is possible for some datasets, but not much more).
Also, according to my experiments transformer_big_single_gpu with batch_size=1500 is clearly better than transformer_base_single_gpu with batch_size=4500 (after 1 day of training on one GPU, or after 17 hours of training on 8 GPUs). And transformer_big_single_gpu with batch_size=2000 is even better.

@apeterswu
Copy link

apeterswu commented Jan 16, 2018

Hi, I try to use transformer_enfr_big setting to run wmt_enfr32k data. On M40 with 24GB memory, it only supports 2000 batch size, I train the model with 8M40x2000.... After training about 90000 steps, the bleu is only 27.x, which is far away from the reported result. That's so hard to reproduce the wmt_enfr32k result.
Besides, for evaluation, I just feed in the en-fr.en test file(raw data), and use the output to match the en-fr.fr(raw data), is there anything wrong?
@lukaszkaiser Could one please give a detailed description about how to reproduce the wmt_enfr32k result? I am struggling to reproduce the enfr result...

@martinpopel
Copy link
Contributor

I have one more tip how to increase the batch size: set the max_length parameter to control too long sentences, e.g. `--hparams="batch_size=2300,max_length=100". Without the max_length restriction, I was able to use batch size 2040 at most. With max_length=150, I was able to use 2200.

@apeterswu
Copy link

@vince62s , would you please kindly share your training setting and evaluation details? What is the hparams setting, training steps, batch size, test raw data or tokenized data, test with multi-bleu.perl? I have tried several times to run transformer_big_enfr and transformer_big setting, but they only got nearly 32.x bleu score, which is far low then the paper, also your 38 result. I am so confused.....Could you please help a little? Appreciate a lot.

@vince62s
Copy link
Contributor

For this one, I did nothing special, but I have 4 GPU on which I can fit a batch of 4096 on each.
Take base_V2.
Since the paper runs the base model during 100K steps on 8 GPU, I ran it during 200K steps on 4 GPU.

if you are on 1 GPU you can't expext good results on the big after 90K steps. Start with the base.

@apeterswu
Copy link

@vince62s Do you mean you use transformer_base() setting? Actually, I run the transformer_big exp on 8 GPUs with batch size 5500. But more than 100k, I still got low BLEU score. Would you kindly may share the training script? Thanks.

@apeterswu
Copy link

Is there any progress on reproduce En-Fr?

@robotzheng
Copy link

I report the result of “BLEU_uncased = 35.53
BLEU_cased = 34.53”,8GPUs,batchsize=3125.

@robotzheng
Copy link

in fairseg, 2 epoches,report the result:
Namespace(beam=4, cpu=False, data='./data-bin/wmt14_en_fr', fp16=False, gen_subset='test', left_pad_source='True', left_pad_target='False', lenpen=0.6, log_format=None, log_interval=1000, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, min_len=1, model_overrides='{}', nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=True, num_shards=1, path='./checkpoints/transformer_vaswani_wmt_en_fr_big/checkpoint_best.pt', prefix_size=0, print_alignment=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, sampling=False, sampling_temperature=1, sampling_topk=-1, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', unkpen=0, unnormalized=False)
| [en] dictionary: 40968 types
| [fr] dictionary: 42472 types
| ./data-bin/wmt14_en_fr test 3003 examples
| ./data-bin/wmt14_en_fr test 3003 examples
| loading model(s) from ./checkpoints/transformer_vaswani_wmt_en_fr_big/checkpoint_best.pt
| Translated 3003 sentences (94316 tokens) in 71.4s (42.09 sentences/s, 1321.79 tokens/s)
| Generate test with beam=4: BLEU4 = 38.64, 66.8/45.0/32.3/23.6 (BP=0.993, ratio=0.993, syslen=82197, reflen=82787)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants