Take `split` param from config in all load_dataset instances #2281

mashdragon · 2025-01-23T15:29:19Z

Description

Lets the split parameter in a dataset's config be used in load_dataset whenever load_dataset is used to load the dataset.

Motivation and Context

I had no idea what the split parameter was or was not doing because it wasn't manipulating my local jsonl data source in any way. Then I discovered in the code that it was being used only for datasets downloaded directly from Huggingface, when really we should be able to set it for all datasets.

The problem it solves is when I want to use only part of my training dataset, not all of it, which comes up very often for initial testing.

How has this been tested?

I've tested on the local file dataset, and it's functional - when I specified a split of train[60%:70%], it correctly returned only 10% of my data, where previously axolotl had returned the whole dataset. I assume this is functional on the other dataset types as well. There is nothing the Huggingface docs to say otherwise.

Screenshots (if appropriate)

Types of changes

Changes to load_dataset_w_config (I see no other boxes to check by the way)

Social Handles (Optional)

winglian · 2025-01-23T18:48:48Z

So the reason we don't manage the split within this is that in functions like load_tokenized_prepared_datasets we call them separately for each the train datasets and eval datasets as the manual split. so from within that function it calls load_dataset_w_config and we handle the split separately.

winglian · 2025-01-23T20:45:25Z

@NanoCode012 This looks good to me, but can you take another look at it too pls?

NanoCode012

Yep, looks good. I did some local testing and it works for local files. Only part that I couldn't test would be for remote files, but they should also work.

…-ai-cloud#2281)

Take split param from config in all load_dataset instances

9fbb329

winglian approved these changes Jan 23, 2025

View reviewed changes

winglian requested a review from NanoCode012 January 23, 2025 20:45

NanoCode012 self-assigned this Jan 24, 2025

NanoCode012 approved these changes Jan 24, 2025

View reviewed changes

NanoCode012 added the ready to merge label Jan 24, 2025

winglian merged commit b2774af into axolotl-ai-cloud:main Jan 24, 2025
11 checks passed

NJordan72 pushed a commit to NJordan72/axolotl that referenced this pull request Jan 28, 2025

Take split param from config in all load_dataset instances (axolotl…

f94fe76

…-ai-cloud#2281)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take `split` param from config in all load_dataset instances #2281

Take `split` param from config in all load_dataset instances #2281

mashdragon commented Jan 23, 2025

winglian commented Jan 23, 2025

winglian commented Jan 23, 2025

NanoCode012 left a comment

Take split param from config in all load_dataset instances #2281

Take split param from config in all load_dataset instances #2281

Conversation

mashdragon commented Jan 23, 2025

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Jan 23, 2025

winglian commented Jan 23, 2025

NanoCode012 left a comment

Choose a reason for hiding this comment

Take `split` param from config in all load_dataset instances #2281

Take `split` param from config in all load_dataset instances #2281