Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take split param from config in all load_dataset instances #2281

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

mashdragon
Copy link
Contributor

Description

Lets the split parameter in a dataset's config be used in load_dataset whenever load_dataset is used to load the dataset.

Motivation and Context

I had no idea what the split parameter was or was not doing because it wasn't manipulating my local jsonl data source in any way. Then I discovered in the code that it was being used only for datasets downloaded directly from Huggingface, when really we should be able to set it for all datasets.

The problem it solves is when I want to use only part of my training dataset, not all of it, which comes up very often for initial testing.

How has this been tested?

I've tested on the local file dataset, and it's functional - when I specified a split of train[60%:70%], it correctly returned only 10% of my data, where previously axolotl had returned the whole dataset. I assume this is functional on the other dataset types as well. There is nothing the Huggingface docs to say otherwise.

Screenshots (if appropriate)

Types of changes

  • Changes to load_dataset_w_config (I see no other boxes to check by the way)

Social Handles (Optional)

@winglian
Copy link
Collaborator

So the reason we don't manage the split within this is that in functions like load_tokenized_prepared_datasets we call them separately for each the train datasets and eval datasets as the manual split. so from within that function it calls load_dataset_w_config and we handle the split separately.

@winglian winglian requested a review from NanoCode012 January 23, 2025 20:45
@winglian
Copy link
Collaborator

@NanoCode012 This looks good to me, but can you take another look at it too pls?

@NanoCode012 NanoCode012 self-assigned this Jan 24, 2025
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks good. I did some local testing and it works for local files. Only part that I couldn't test would be for remote files, but they should also work.

@winglian winglian merged commit b2774af into axolotl-ai-cloud:main Jan 24, 2025
11 checks passed
NJordan72 pushed a commit to NJordan72/axolotl that referenced this pull request Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants