Skip to content

Latest commit

 

History

History
25 lines (24 loc) · 1.93 KB

lecture_7.md

File metadata and controls

25 lines (24 loc) · 1.93 KB

Lecture 7 - Conditional Language Models

  • Unconditional LM assigns probs to seqs of words w=(w1, ..., wl)
  • Task: modeling prob of the next word by the history of prevs words
  • Conditional LM assigns probs to seqs of words by the conditioning context x
  • to train we need paired samples {(x, w)}
  • w^* = arg max p(w | x). Approximate it using a beam search.
  • Evaluating conditional LMs - using cross-entropy or perplexity. Or easier and interpretable - Task-specific evaluation: BLEU, METEOR, WER, ROUGE - metric L(w^*, w_{ref})
  • Encoder-Decoder model
    • how should we define c=embed(x)?
    • Convolutional sentence model (CSM)
    • Good - learn interactions among local features in context, long dependencies can be learnt. Bad - sentences have different lengths
    • RNN Decoder - p(tom|s,<s>) x p(likes|s,<s>,tom) x p(beer|s,<s>,tom,likes) x p(<\s>|s,<s>,tom,likes,beer)
    • Sutskever et al. (2014) - LSTM Encoder-Decoder
    • Good - RNNs deal with seqs of various lengths, LSTMs can prop grads a long distance, simple architecture. Bad - hidden state has to remember a lot of information!
    • Trick#1 - read the input seq "backwards": +4 BLEU
    • Trick#2 - use an ensemble of J independently trained models: ensemble of 2 models gives +3 BLEU, ensemble of 5 models gives +4.5 BLEU
    • We want to fine the most probable (MAP) output given the input: w^* = arg max p(w|x). It's a hard (undecidable) problem for RNNs, approximate it with a greedy search: w^*_1 = arg max p(w_1 | x), w^*_2 = arg max p(w_2 | x, w^*_1), ...
    • Better approximation - beam search (keep track of top b hypothesis - beam size b).
    • Use beam search: +1 BLEU
  • Image cation generation
  • Kiros et al. (2013) - Innovation: multiplicative interactions in the decoder n-gram model
    • Encoder: x = embed(x)
    • Unconditional n-gram model: h=W[w, ...]
    • Simple conditional n-gram LM: h=W[w, ...] + Cx