-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch sampler #3123
Comments
Hello! Yes, data shuffling is certainly enabled in a few of those. To be specific, the
If you'd like to avoid shuffling, then you can use a custom batch sampler quite easily: class CustomSentenceTransformerTrainer(SentenceTransformerTrainer):
def get_batch_sampler(
self,
dataset: Dataset,
batch_size: int,
drop_last: bool,
valid_label_columns: list[str] | None = None,
generator: torch.Generator | None = None,
) -> BatchSampler | None:
return DefaultBatchSampler(
range(len(dataset)),
batch_size=batch_size,
drop_last=drop_last,
) I think this will already do it. Instead of using a
|
Hi Tom, can you give an example of a custom multi-dataset batch sampler? |
These are 2 examples of the (existing) multi-dataset batch samplers: sentence-transformers/sentence_transformers/sampler.py Lines 225 to 316 in cfb883c
If you want to make your own, then I would recommend following this format (e.g. these
|
I want to perform training where I have arranged the data samples in the exact order I desire. If shuffling occurs here, everything will be disrupted. Therefore, I would like to ask: does the Batch Sampler perform any data shuffling in this context?
The text was updated successfully, but these errors were encountered: