Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

hallerite
Copy link
Collaborator

@hallerite hallerite commented Mar 11, 2025

Description

This PR simplifies the current approach, where we define BaseDataset, SeedDataset, SyntheticDataset and GenerativeDataset. Specifically, it combines BaseDataset, SeedDataset and SyntheticDataset into one class called StaticDataset, as they are all static, i.e. do not change size at runtime.

GenerativeDataset on the other hand generates new samples at runtime and is hence structurally different, thus deserving its own class. This PR adapts GenerativeDataset to now take a StaticDataset as attribute (before, it took a SeedDataset), which is used to generate new synthetic datapoints. This PR also adds a way to save the data generated at runtime into a JSONL file (such that it can be loaded efficiently for downstream purposes).

Lastly, this PR removes a lot of code that is unused and whose functionality has been implemented elsewhere.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

Note to reviewers

apokryphosx and others added 30 commits March 7, 2025 13:48
initialized from HF/Pytorch/JSON/list of Dicts,
remove the need for setup call and subsequently
cleanup
instead of strings and add seed for reproducibility
between simply skipping invalid datapoints in a
seed dataset and throwing an exception
seed dataset to ensure they are defined before the
other functions are
getitem and cast len(data) to a Sized to pass mypy
tests
…of Seed Dataset to ensure proper validation & add additional logging to init from JSON
call helper functions for each type of data
initialising with PyTorch Datasets
properly cover all 4 initialization functions &
add tests for sampling
@hallerite hallerite self-assigned this Mar 11, 2025
@hallerite hallerite added this to the Sprint 25 milestone Mar 11, 2025
@hallerite hallerite added enhancement New feature or request Refactor labels Mar 11, 2025
@hallerite
Copy link
Collaborator Author

hallerite commented Mar 11, 2025

TO DO:

  • Update tests
  • Merge master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants