Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are batch statistics computed? #6

Open
OverLordGoldDragon opened this issue Dec 20, 2019 · 0 comments
Open

How are batch statistics computed? #6

OverLordGoldDragon opened this issue Dec 20, 2019 · 0 comments

Comments

@OverLordGoldDragon
Copy link

I'm implementing recurrent BN in Keras, but looking at the original paper and those citing it, a detail remains unclear to me: how are batch statistics computed? In the original, authors state (pg. 3) (emphasis mine):

At training time, the statistics E[h] and Var[h] are estimated by the sample mean and sample variance of the current minibatch

Yet another paper (pg. 3) using and citing it describes:

We subscript BN by time (BN_t) to indicate that each time step tracks its own mean and variance. In practice, we track these statistics as they change over the course of training using an exponential moving average (EMA)

My question's thus two-fold:

  1. Are minibatch statistics computed per immediate minibatch, or as an EMA?
  2. How are the inference parameters, shared across all timesteps, gamma and beta computed? Is the computation in (1) simply averaged across all timesteps? (e.g. average EMA_t for all t)

Existing implementations: in Keras and TF below, but are all outdated, and am unsure regarding correctness

  • Keras, TF-A, and TF-B
  • All above agree that during training, immediate minibatch statistics are used, and that beta and gamma are updated as an EMA of these minibatches
  • Problem: the bn operation (in A, and presumably B & C) is applied on a single timestep slice, to be passed to the K.rnn control flow for re-iteration. Hence, EMA is computed w.r.t. minibatches and timesteps - which I find questionable:
  • EMA is used in place of a simple average when population statistics are dynamic (e.g. minibatch-to-minibatch), whereas we have access to all timesteps in a minibatch prior having to update gamma and beta
  • EMA is a worse but at times necessary alternative to a simple average, but per above, we can use latter - so why don't we? Timestep statistics can be cached, averaged at the end, then discarded - holds also for stateful=True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant