♻️ Modernize the Hugging Face bridge #240

casenave · 2025-09-26T19:14:52Z

Checklist

PR Summary

This PR modernizes the Hugging Face bridge by:

Supporting DatasetDict: leverage the native split mechanism for efficient partial dataset access.
Simplifying metadata handling: remove cumbersome tricks for setting problem_definition and infos in the dataset description; provide clear functions to load/save them via JSON and YAML.
Enabling multiple problem definitions per repo: allow the bridge to handle different problem definitions seamlessly.
I/O functions: add load/save utilities that work both with Hugging Face Hub repositories and the local filesystem.

HF PLAID datasets conversion

Current PLAID datasets can be downloaded, converted and uploaded with:

from plaid.bridges import huggingface_bridge
import datasets
import os

os.environ["HF_HUB_DISABLE_XET"] = "1"  # temporary (?) trick

source_repo_id = "PLAID-datasets/Tensile2d"
target_repo_id = "fabiencasenave/Tensile2d_converted"

hf_dataset = datasets.load_dataset("PLAID-datasets/Tensile2d", split="all_samples")
dataset = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset, processes_number = 12, verbose = True)

pb_def = huggingface_bridge.huggingface_description_to_problem_definition(hf_dataset.description)
infos = huggingface_bridge.huggingface_description_to_infos(hf_dataset.description)

main_splits = ["train_500", "test", "OOD"]
hf_dataset_dict = huggingface_bridge.plaid_dataset_to_huggingface_datasetdict(dataset, pb_def, main_splits)

huggingface_bridge.push_dataset_dict_to_hub(target_repo_id, hf_dataset_dict)
huggingface_bridge.push_dataset_infos_to_hub(target_repo_id, infos)
huggingface_bridge.push_problem_definition_to_hub(target_repo_id, "task_1", pb_def)

Results here

Then:

hf_dataset = huggingface_bridge.load_hf_dataset_from_hub(target_repo_id, split='train_500[:10]')
infos = huggingface_bridge.load_hf_infos_from_hub(target_repo_id)
pb_def = huggingface_bridge.load_hf_problem_definition_from_hub(target_repo_id, "task_1")

print(f"{hf_dataset = }")
print('--')
print(f"{infos = }")
print('--')
print(f"{pb_def = }")

gives:

hf_dataset = Dataset({
    features: ['sample'],
    num_rows: 10
})
--
infos = {'data_production': {'physics': '2D quasistatic non-linear structural mechanics, small deformations, plane strain', 'type': 'simulation'}, 'legal': {'license': 'CC-BY-SA', 'owner': 'Safran'}}
--
pb_def = ProblemDefinition(input_scalars_names=['P', 'p1', 'p2', 'p3', 'p4', 'p5'], output_scalars_names=['max_von_mises', 'max_q', 'max_U2_top', 'max_sig22_top'], output_fields_names=['U1', 'U2', 'q', 'sig11', 'sig12', 'sig22'], input_meshes_names=['/Base_2_2/Zone'], task='regression')

Splits

These modification drop the support for "subsplits": we only rely on "main split", the one defined natively through the DatasetDict. Hence, splits in problem_definition are ignored by the Hugging Face bridge.

🔗 Related issues

Addresses tasks from

#160
#241
#219

codecov · 2025-09-26T19:16:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

tests/bridges/temp2.py

casenave · 2025-10-02T20:01:06Z

TODO: implement typing and docstring for all added code, and transform the two files added in tests/bridges into a jupytext example (to compile with the doc) - and ruff formatting

xroynard · 2025-10-09T09:48:07Z

src/plaid/utils/base.py

+def get_mem():
+    """Get the current memory usage of the process in MB."""
+    process = psutil.Process(os.getpid())
+    return process.memory_info().rss / (1024**2)  # in MB


not sure it should be in plaid, as it is only used in examples, maybe in examples/utils.py ?

xroynard

LGTM !

start work

ad47c5c

casenave requested a review from a team as a code owner September 26, 2025 19:14

casenave marked this pull request as draft September 26, 2025 19:14

casenave added 4 commits September 26, 2025 21:16

continue

99e25ba

continue

38ae46d

continue

0b4e58c

continue

7fae9ea

casenave changed the title ~~♻️ Modernize the Hugging Face bridge and the split mechanisms~~ ♻️ Modernize the Hugging Face bridge Sep 27, 2025

casenave added 3 commits September 27, 2025 15:32

continue

f6c7cb5

continue

e0fe06a

drop split support for the Hugging Face bridge

e97b15d

casenave mentioned this pull request Sep 27, 2025

[DATASET ROADMAP] convert public bases #219

Open

18 tasks

casenave and others added 7 commits September 27, 2025 16:22

continue

55089c2

continue

fa0dd31

continue

d83b7b1

wip

7d3eedc

wip

d86611c

continue

0d59d1c

continue

a674132

xroynard reviewed Sep 30, 2025

View reviewed changes

tests/bridges/temp2.py Outdated Show resolved Hide resolved

fabiencasenave and others added 9 commits September 30, 2025 15:39

continue

c4d39ac

continue

cecda9d

continue

c18e2c7

continue

bb80f8b

save

44f3fa6

continue

1de5ccb

continue

07e5578

continue

9e47557

continue

d5ae7eb

casenave added 4 commits October 2, 2025 18:03

continue

1cbf8ec

continue

1e3a483

continue

552b089

continue

91437f1

casenave mentioned this pull request Oct 3, 2025

[ARCHITECTURE UPDATE] update data organisation #241

Open

fabiencasenave and others added 10 commits October 3, 2025 13:55

continue

517b825

continue

e34782f

continue

837c679

continue

af9f4e5

continue

a54de4c

final clean

330809b

add a file for checking regression with respect to PLAID benchmarks

4d64247

add a file for checking regression with respect to PLAID benchmarks

4303d1a

Merge branch 'datasets_roadmap' into hf_splits

9d872b6

update CHANGELOG

cad4c76

casenave added this to the version 0.1.10 milestone Oct 5, 2025

fabiencasenave and others added 3 commits October 6, 2025 15:26

change default enforce_shapes value

3d1957c

merge

8bf1cd6

Merge branch 'datasets_roadmap' into hf_splits

481ea34

xroynard reviewed Oct 9, 2025

View reviewed changes

xroynard approved these changes Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

♻️ Modernize the Hugging Face bridge #240

♻️ Modernize the Hugging Face bridge #240

casenave commented Sep 26, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

casenave commented Oct 2, 2025 •

edited

Loading

Uh oh!

xroynard Oct 9, 2025

Uh oh!

xroynard left a comment

Uh oh!

Uh oh!

♻️ Modernize the Hugging Face bridge #240

Are you sure you want to change the base?

♻️ Modernize the Hugging Face bridge #240

Conversation

casenave commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

PR Summary

HF PLAID datasets conversion

Splits

🔗 Related issues

Uh oh!

codecov bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

casenave commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xroynard Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

xroynard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

casenave commented Sep 26, 2025 •

edited

Loading

codecov bot commented Sep 26, 2025 •

edited

Loading

casenave commented Oct 2, 2025 •

edited

Loading