Skip to content

Updaters

Daniel Seita edited this page Jan 20, 2015 · 26 revisions

Table of Contents

Overview

An updater is a generalization of a gradient optimization step. Several updaters do perform gradient optimization, but there are others that do something different. e.g. multiplicative updates are used in NMF, and models based on quotients of accumulated data are updated by recomputing the quotient on moving averages. Updating is broken out from the generation of the gradient (or quotient pair) which happens in the model's update method. Updaters support a few methods:

init: to initialize any params and working storage for the updater.
update: the basic update called by the learner on each minibatch. 
updateM: an update called at the end of a pass over the dataset. 
clear: clears working storage, called at the beginning of a dataset pass.

BatchNorm Updater

Used for batch mode inference in LDA, NMF and Gibbs sampling. Accumulates update data into numerator and denominator "accumulator" matrices during a pass over the dataset. At the end of the pass, it updates the model by computing the ratio of the accumulators.

IncNorm Updater

The incremental version of the Batch Norm updater. Both numerator and denominator are updated using moving averages.

<math>M_t} = \alpha U_t + (1-\alpha) M_{t-1} </math>

where <math>M_t}</math> is the matrix at time <math>t</math>, <math>U_t</math> is the update at time <math>t</math>, and <math>\alpha</math> is defined as

<math>\alpha = \left( \frac{1}{t}\right)^p</math>

A value of <math>p=1</math> (set with the power option to the updater) gives a uniform average of the updates up to time <math>t</math>. Smaller values of <math>p</math> weight recent data more heavily. A value of <math>p=0.5</math> mimics the temporal envelope of ADAGRAD. In practice values smaller than 0.5 often give best performance. The default value is 0.3 which works well with most of the models.

Grad(ient) Updater

This updater just adds gradient updates to the model, using a decay schedule. It includes a few options:

lrate:FMat(1f): the learning rate, or scale factor applied to gradients at each step.
texp:FMat(0.5f): the decay exponent.
waitsteps = 2: number of steps to wait before starting decay.
mask:FMat = null: a mask to stop updates to some values (e.g. constants in the model).

Updates using the gradient updater follow this scheme:

<math>M_t = \alpha U_t + M_{t-1} </math>

where <math>\alpha</math> is defined as

<math>\alpha = lrate \left( \frac{1}{t}\right)^{texp}</math>

ADAGRAD Updater

ADAGRAD is an extremely effective scheme for updating models optimized using stochastic gradients. ADAGRAD implements both a learning rate adjustment and a gradient normalization. It is (in the form we use it) an approximate second-order normalization of the gradient by a diagonal matrix. ADAGRAD's updates can be written as:

<math>M_t = \alpha U_t D + M_{t-1} </math>

where <math>\alpha</math> is defined as

<math>\alpha = lrate \left( \frac{1}{t}\right)^{texp}</math>

and <math>D</math> is a diagonal matrix whose elements are the vector

<math>\left (\frac{1}{t} \sum_{i=1}^t U_i^2 \right)^{-vexp}</math>

This form of update differs slightly from the original ADAGRAD, in which both <math>texp</math> and <math>vexpt</math> are set to 0.5. The choice of 0.5 allows strong proofs of convergence. It means that gradients are scaled by a cumulative 2-norm of gradients, effectively a diagonal whitening, and minimizing the variance of the gradient updates. ADAGRAD contrasts with other second order methods which attempt to minimize bias (find a steepest descent direction), and which use an exponent of 1. By separating the directional and temporal exponents, we allow tuning them separately. Using <math>texp < 0.5</math> generally improves convergence rate in practice, while <math>vexp >= 0.5</math> are best. Using larger values of <math>vexp</math> improves bias at the expense of variance during optimization.

CG Updater

There is an experimental conjugate gradient updater in the updater package. It is currently only used by SFA (Sparse Factor Analysis). It needs to be extended to preconditioned conjugate gradient and its options and use are likely to change. We wont discuss it further.