Skip to content

Conversation

casenave
Copy link
Member

@casenave casenave commented Sep 26, 2025

Checklist

  • Typing enforced
  • Documentation updated
  • Changelog updated
  • Tests and Example updates
  • Coverage should be 100%

PR Summary

This PR modernizes the Hugging Face bridge by:

  • Supporting DatasetDict: leverage the native split mechanism for efficient partial dataset access.
  • Simplifying metadata handling: remove cumbersome tricks for setting problem_definition and infos in the dataset description; provide clear functions to load/save them via JSON and YAML.
  • Enabling multiple problem definitions per repo: allow the bridge to handle different problem definitions seamlessly.
  • I/O functions: add load/save utilities that work both with Hugging Face Hub repositories and the local filesystem.

HF PLAID datasets conversion

Current PLAID datasets can be downloaded, converted and uploaded with:

from plaid.bridges import huggingface_bridge
import datasets
import os

os.environ["HF_HUB_DISABLE_XET"] = "1"  # temporary (?) trick

source_repo_id = "PLAID-datasets/Tensile2d"
target_repo_id = "fabiencasenave/Tensile2d_converted"

hf_dataset = datasets.load_dataset("PLAID-datasets/Tensile2d", split="all_samples")
dataset = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset, processes_number = 12, verbose = True)

pb_def = huggingface_bridge.huggingface_description_to_problem_definition(hf_dataset.description)
infos = huggingface_bridge.huggingface_description_to_infos(hf_dataset.description)

main_splits = ["train_500", "test", "OOD"]
hf_dataset_dict = huggingface_bridge.plaid_dataset_to_huggingface_datasetdict(dataset, pb_def, main_splits)

huggingface_bridge.push_dataset_dict_to_hub(target_repo_id, hf_dataset_dict)
huggingface_bridge.push_dataset_infos_to_hub(target_repo_id, infos)
huggingface_bridge.push_problem_definition_to_hub(target_repo_id, "task_1", pb_def)

Results here

Then:

hf_dataset = huggingface_bridge.load_hf_dataset_from_hub(target_repo_id, split='train_500[:10]')
infos = huggingface_bridge.load_hf_infos_from_hub(target_repo_id)
pb_def = huggingface_bridge.load_hf_problem_definition_from_hub(target_repo_id, "task_1")

print(f"{hf_dataset = }")
print('--')
print(f"{infos = }")
print('--')
print(f"{pb_def = }")

gives:

hf_dataset = Dataset({
    features: ['sample'],
    num_rows: 10
})
--
infos = {'data_production': {'physics': '2D quasistatic non-linear structural mechanics, small deformations, plane strain', 'type': 'simulation'}, 'legal': {'license': 'CC-BY-SA', 'owner': 'Safran'}}
--
pb_def = ProblemDefinition(input_scalars_names=['P', 'p1', 'p2', 'p3', 'p4', 'p5'], output_scalars_names=['max_von_mises', 'max_q', 'max_U2_top', 'max_sig22_top'], output_fields_names=['U1', 'U2', 'q', 'sig11', 'sig12', 'sig22'], input_meshes_names=['/Base_2_2/Zone'], task='regression')

Splits

These modification drop the support for "subsplits": we only rely on "main split", the one defined natively through the DatasetDict. Hence, splits in problem_definition are ignored by the Hugging Face bridge.

🔗 Related issues

Addresses tasks from

#160
#241
#219

@casenave casenave requested a review from a team as a code owner September 26, 2025 19:14
@casenave casenave marked this pull request as draft September 26, 2025 19:14
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@casenave casenave changed the title ♻️ Modernize the Hugging Face bridge and the split mechanisms ♻️ Modernize the Hugging Face bridge Sep 27, 2025
@casenave
Copy link
Member Author

casenave commented Oct 2, 2025

TODO: implement typing and docstring for all added code, and transform the two files added in tests/bridges into a jupytext example (to compile with the doc) - and ruff formatting

@casenave casenave added this to the version 0.1.10 milestone Oct 5, 2025
Comment on lines +57 to +60
def get_mem():
"""Get the current memory usage of the process in MB."""
process = psutil.Process(os.getpid())
return process.memory_info().rss / (1024**2) # in MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it should be in plaid, as it is only used in examples, maybe in examples/utils.py ?

Copy link
Contributor

@xroynard xroynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants