Attention

What does Attention in Neural Machine Translation Pay Attention to? paper
Some interesting findings, and more details in the paper:
~ only 54% attention to alignment points, which NUM 73%, NOUN 68%, VERB just 49%, and PRT(Particle, such as ’s, off, up) just 36%.
~ Attention acc on alignments are high for NOUN, and very low for the VERB, but their target losses are about the same.

Tricks

REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS paper
1.Label smoothing;
2.Confidence penalty

R-Drop: Regularized Dropout for Neural Networks paper add KL divergence between the results from two times of dropout