-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or requestorganisationEvolution in project organisationEvolution in project organisation
Milestone
Description
I propose to update the architecture of dataset to mimic the one of HuggingFace datasets as close as possible:
folder
├── data
│ ├── train
│ │ ├── sample_000000000
│ │ │ ├── features_000000000.cgns
│ │ │ ├── features_000000001.cgns
│ │ └── sample_0000000001
│ │ │ ├──...
│ ├── test
│ │ ├── ...
├── infos.yaml
└── problem_definitions
│ └── task_1
│ ├── problem_infos.yaml
│ └── split.json
│ ├── task_2
│ │ ├── ...
Like HF datasets, we can introduce Dataset
(the actual one) and DatasetDict:dict[str,Dataset]
. The split will contain the keys of DatasetDict and subsplit with a numbering local to the corresponding key (train, test, ...).
Doing this will make hf datasets repo very similar to the data memory mapping:

(this was obtained by using the hf_dataset.push_to_hub(repo_id)
and our Hugging Face bridge)
The multiple problem definition proposal will indeed enable multiple task defined over the same dataset:

The work I did in #240 implements this for HF dataset repos of PLAID datasets. I think the problem definition can be modifyins as well to:
- indicate
in
andout
split concerned by the regression task - name the score function for the moment (maybe later we should find a way to define an implementation
- rely on the flatten tree keys for the in and out feature identifiers, e.g.
Base_2_2/Zone/GridCoordinates/CoordinateX
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestorganisationEvolution in project organisationEvolution in project organisation