Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Parquet as general dataset format #225

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

BerndDoser
Copy link
Member

Apache Parquet offers several advantages as a dataset:

  • Compressible
  • Streamable
  • Allows random access
  • Supports versioning

Apache Arrow is an ideal framework for managing Parquet datasets:

  • In-memory data processing framework
  • Support for various languages: Python (pyarrow), C++, Rust, Julia, ...

Special features for ~Spherinator`:

  • Support different data types: images, time series, point clouds, graphs, data cubes, spectra
  • Metadata in extra columns

ParquetDataset

  • ParquetDataset
  • Load all at once into memory
  • Specifying the data column used as training data
  • Multidimensional arrays are flattened since parquet only supports 1D arrays; the shape is stored
    in the metadata <data_column>_shape and will be reshaped in the DataLoader.
  • Multiple data columns will be concatenated: e.g., BP and RP flux

Define ParquetDataModule in config.yaml:

data:
  class_path: spherinator.data.ParquetDataModule
  init_args:
    data_directory: /local/gaia/xp_sampled_mean_spectrum/parquet
    data_column: flux
    normalize: minmax
    batch_size: 2048
    num_workers: 4

ParquetIterableDataset

@BerndDoser BerndDoser self-assigned this Jan 30, 2025
@BerndDoser BerndDoser added this to the v0.4 milestone Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant