Take split
param from config in all load_dataset instances
#2281
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Lets the
split
parameter in a dataset's config be used inload_dataset
wheneverload_dataset
is used to load the dataset.Motivation and Context
I had no idea what the split parameter was or was not doing because it wasn't manipulating my local jsonl data source in any way. Then I discovered in the code that it was being used only for datasets downloaded directly from Huggingface, when really we should be able to set it for all datasets.
The problem it solves is when I want to use only part of my training dataset, not all of it, which comes up very often for initial testing.
How has this been tested?
I've tested on the local file dataset, and it's functional - when I specified a split of
train[60%:70%]
, it correctly returned only 10% of my data, where previously axolotl had returned the whole dataset. I assume this is functional on the other dataset types as well. There is nothing the Huggingface docs to say otherwise.Screenshots (if appropriate)
Types of changes
load_dataset_w_config
(I see no other boxes to check by the way)Social Handles (Optional)