Exact behavior of different batcher settings. #12174

egumasa · 2023-01-24T19:59:01Z

egumasa
Jan 24, 2023

Hi I am curious to know more about the exact behavior of batch_by_sequence as opposed to batch_by_words or batch_by_padded.

I think I understood batch by word and padded; they cut the docs in train.spacy into words of the lengths to create a batch.

My question is: Does the batch_by_sequence use doc as the unit of analysis for the "sequence", not sentences or any other units? So does setting this to size = 4 (for instance) mean that each mini-batch will contain exactly 4 docs (as defined in the asset), is this correct? (assuming that accumulate_gradient is set 1).

I know that longer sequences are sliced into padded lengths before the transformer layer, but I still see the advantages of using batch_by_sequence if the sequence is defined as a doc in the training set.

Thank you for your help in deciphering the internal implementation!

danieldk · 2023-01-25T12:46:00Z

danieldk
Jan 25, 2023

I think I understood batch by word and padded; they cut the docs in train.spacy into words of the lengths to create a batch.

The batchers never cut documents. The batchers decide how many documents should be added to the next batch:

batch_by_words adds documents to a batch until it has (roughly) the given size in words (sum of the doc lengths).
batch_by_padded works similar to batch_by_words, except that it counts the length of each doc as if all the docs were padded to the length of the longest doc. Suppose that we had docs of lengths [5, 7, 3]. batch_by_words considers their size to be 5 + 7 + 3 = 15 words, whereas batch_by_badded considers their size to be 3 * 7 = 21 (since all docs have length 7 if they were padded to the longest document size).
batch_by_sequence: simply gives batches consisting of the given number of docs.

So does setting this to size = 4 (for instance) mean that each mini-batch will contain exactly 4 docs (as defined in the asset), is this correct?

Yes.

I know that longer sequences are sliced into padded lengths before the transformer layer, but I still see the advantages of using batch_by_sequence if the sequence is defined as a doc in the training set.

We generally recommend to use the word-based batchers, because equisized batches (in number of tokens) lead to more predictable/stable parameter updates.

2 replies

egumasa Jan 25, 2023
Author

Thank you so much for the details, Daniel.

I have follow-up questions about the batch_by_padded.
Is there any reason the quick start configuration recommends batch_by_padded as opposed to batch_by_word when the training is set to run on GPU?

Default setting for GPU

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

Default setting for CPU

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

My question is: do you prefer batch_by_padded on GPU over batch_by_word if we train with the same size (let's say, 2000)? If so, why might it be?

danieldk Jan 26, 2023

My question is: do you prefer batch_by_padded on GPU over batch_by_word if we train with the same size (let's say, 2000)? If so, why might it be?

The CPU model trains a convolution network, which doesn't use padding during training. The GPU model trains a transformer, which uses padding. This is the reason for the difference. Though admittedly, the padding in the transformer works differently from how the padded sizes in batch_by_padded are estimated. This is an issue that we are currently looking into.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact behavior of different batcher settings. #12174

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Exact behavior of different batcher settings. #12174

egumasa Jan 24, 2023

Replies: 1 comment · 2 replies

danieldk Jan 25, 2023

egumasa Jan 25, 2023 Author

Default setting for GPU

Default setting for CPU

danieldk Jan 26, 2023

egumasa
Jan 24, 2023

Replies: 1 comment 2 replies

danieldk
Jan 25, 2023

egumasa Jan 25, 2023
Author