Exact behavior of different batcher settings. #12174
Replies: 1 comment 2 replies
-
The batchers never cut documents. The batchers decide how many documents should be added to the next batch:
Yes.
We generally recommend to use the word-based batchers, because equisized batches (in number of tokens) lead to more predictable/stable parameter updates. |
Beta Was this translation helpful? Give feedback.
-
Hi I am curious to know more about the exact behavior of
batch_by_sequence
as opposed tobatch_by_words
orbatch_by_padded
.I think I understood batch by word and padded; they cut the docs in
train.spacy
into words of the lengths to create a batch.My question is: Does the
batch_by_sequence
usedoc
as the unit of analysis for the "sequence", not sentences or any other units? So does setting this to size = 4 (for instance) mean that each mini-batch will contain exactly 4 docs (as defined in the asset), is this correct? (assuming that accumulate_gradient is set 1).I know that longer sequences are sliced into padded lengths before the transformer layer, but I still see the advantages of using
batch_by_sequence
if the sequence is defined as adoc
in the training set.Thank you for your help in deciphering the internal implementation!
Beta Was this translation helpful? Give feedback.
All reactions