refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

hallerite · 2025-03-11T11:50:56Z

Description

This PR simplifies the current approach, where we define BaseDataset, SeedDataset, SyntheticDataset and GenerativeDataset. Specifically, it combines BaseDataset, SeedDataset and SyntheticDataset into one class called StaticDataset, as they are all static, i.e. do not change size at runtime.

GenerativeDataset on the other hand generates new samples at runtime and is hence structurally different, thus deserving its own class. This PR adapts GenerativeDataset to now take a StaticDataset as attribute (before, it took a SeedDataset), which is used to generate new synthetic datapoints. This PR also adds a way to save the data generated at runtime into a JSONL file (such that it can be loaded efficiently for downstream purposes).

Lastly, this PR removes a lot of code that is unused and whose functionality has been implemented elsewhere.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
I have updated the tests accordingly (required for a bug fix or a new feature)
I have updated the documentation if needed:
I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

Note to reviewers

This PR builds on top of refactor Seed Dataset to improve compatibility and simplify usage #1734.
GenerativeDataset should be an IterableDataset, not a Dataset. This will be addressed in another PR, once this is merged. ([Refactor] Refactor GenerativeDataset for Efficient Synthetic Data Storage and Retrieval #1728)

initialized from HF/Pytorch/JSON/list of Dicts, remove the need for setup call and subsequently cleanup

changes

conversions

test coverage

instead of strings and add seed for reproducibility

to use seed

between simply skipping invalid datapoints in a seed dataset and throwing an exception

seed dataset to ensure they are defined before the other functions are

to store data in Seed Dataset

changes in base.py

requirements

getitem and cast len(data) to a Sized to pass mypy tests

…rict mode

JSON

…of Seed Dataset to ensure proper validation & add additional logging to init from JSON

call helper functions for each type of data

initialising with PyTorch Datasets

properly cover all 4 initialization functions & add tests for sampling

tests to it

hallerite · 2025-03-11T11:56:03Z

TO DO:

Update tests
Merge master

apokryphosx and others added 30 commits March 7, 2025 13:48

feat: Refactor Seed Dataset to be possible to be

82eb593

initialized from HF/Pytorch/JSON/list of Dicts, remove the need for setup call and subsequently cleanup

fix: Update Seed Dataset tests according to the

52f012a

changes

fix: fix precommit and missing space for assertion

93a114c

feat: Extend test coverage to include all possible

7b08d62

conversions

fix: Choose more suitable mock data and enhance

3f1861b

test coverage

fix: Change json path handling as Path objects

ad1949b

instead of strings and add seed for reproducibility

fix: Move length to init and change sample method

50d0795

to use seed

feat: Implement strict flag to let user chose

79d1429

between simply skipping invalid datapoints in a seed dataset and throwing an exception

fix: Put __len__ and __getitem__ to the top of

9bd18d4

seed dataset to ensure they are defined before the other functions are

fix: Add explanation as to why use list of dicts

447bf07

to store data in Seed Dataset

style: Fix code style to adhere to checks

521f511

fix: Update seed init and hf dataset tests to

4162a9a

changes in base.py

style: Fix code style to adhere to code style

094762e

requirements

fix: Adjust code to utilize self._length in

5d222c9

getitem and cast len(data) to a Sized to pass mypy tests

fix: Adjust create datapoint in seed dataset to properly work with st…

a7f2f02

…rict mode

fix: remove casting

3f6aec0

Merge branch 'master' into refactor/simplify-dataset

f7e78a8

fix: Add safety features to Seed Dataset init with

16518aa

JSON

fix: Remove default values from necessary fields in create_datapoint …

c334161

…of Seed Dataset to ensure proper validation & add additional logging to init from JSON

style: Refactor Seed Dataset init functions to

3e0ffe8

call helper functions for each type of data

style: Fix code style to adhere to style requirements

10e22f1

fix: Changed difficulty default value type to string

0055754

refactor: simplify setup

163d9ed

fix: replace direct access with item.get() to circumvent KeyError

9ad0cd7

fix: fix line length

4c7fa21

fix: Change incorrect string formatting

c33f85e

fix: Improve Error message for non dict data when

f3b3454

initialising with PyTorch Datasets

refactor: Refactor tests for seed dataset to

b5baea3

properly cover all 4 initialization functions & add tests for sampling

fix: Add type checks to fix mypy errors and adjust

1627538

tests to it

chore: remove unused objects

14360ec

hallerite added 3 commits March 11, 2025 12:25

chore: remove unused bases class

8aee09c

refactor: split into static and generative dataset

8688490

feat: add way to save generated data

3f44d84

hallerite self-assigned this Mar 11, 2025

hallerite requested review from Wendong-Fan and apokryphosx March 11, 2025 11:51

hallerite added this to the Sprint 25 milestone Mar 11, 2025

hallerite added enhancement New feature or request Refactor labels Mar 11, 2025

fix: fix pre-commit issues

624ea0f

This was referenced Mar 11, 2025

refactor Seed Dataset to improve compatibility and simplify usage #1734

Merged

refactor BaseEnvironment into SingleStep and MultiStep environments #1810

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

hallerite commented Mar 11, 2025 •

edited

Loading

hallerite commented Mar 11, 2025 •

edited

Loading

refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

Are you sure you want to change the base?

refactor: Split dataset into 'StaticDataset' and 'GenerativeDataset' #1801

Conversation

hallerite commented Mar 11, 2025 • edited Loading

Description

Checklist

Note to reviewers

hallerite commented Mar 11, 2025 • edited Loading

hallerite commented Mar 11, 2025 •

edited

Loading

hallerite commented Mar 11, 2025 •

edited

Loading