WIP: Parquet as general dataset format #225

BerndDoser · 2025-01-30T15:54:18Z

Apache Parquet offers several advantages as a dataset:

Compressible
Streamable
Allows random access
Supports versioning

Apache Arrow is an ideal framework for managing Parquet datasets:

In-memory data processing framework
Support for various languages: Python (pyarrow), C++, Rust, Julia, ...

Special features for ~Spherinator`:

Support different data types: images, time series, point clouds, graphs, data cubes, spectra
Metadata in extra columns

ParquetDataset

ParquetDataset
Load all at once into memory
Specifying the data column used as training data
Multidimensional arrays are flattened since parquet only supports 1D arrays; the shape is stored
in the metadata <data_column>_shape and will be reshaped in the DataLoader.
Multiple data columns will be concatenated: e.g., BP and RP flux

Define ParquetDataModule in config.yaml:

data:
  class_path: spherinator.data.ParquetDataModule
  init_args:
    data_directory: /local/gaia/xp_sampled_mean_spectrum/parquet
    data_column: flux
    normalize: minmax
    batch_size: 2048
    num_workers: 4

ParquetIterableDataset

ParquetIterableDataset for streaming data

…ataLoader

instead of loading full dataset in memory read single rows in __getitem__

This reverts commit 4a3cd64.

BerndDoser added 30 commits September 20, 2024 16:01

poetry add pyarrow and test storage of a multidim numpy as tensor

8423d66

convert fits to parquet

ac3803f

read parquet with byte image

ccf020b

test parquet format

78c50b3

convert2: use dataloader and byte images

ea0c876

litdata optimize

e5e9868

test iterable parquet dataset

83d132c

fix data paths

e8eb0f1

poetry update

4d02070

add litdata streaming dataloader

dde1cec

test iterable parquet dataset

d7b2936

rename file not being used as pytest

a061d87

test IterableParquetDataset

ce1a54a

replace PIL Image by numpy array

2fcc325

add gaia experiment and parquet-tools

44b90b9

Set batch_size of IterableParquetDataset to 1 to config it with the D…

ee7e1a7

…ataLoader

Merge branch 'main' into parquet

d655783

fix flake8 warning: unused imports

29f01a7

remove gaia experiment, gaia will be developed in an extra branch

b22dc9b

add ParquetDataModule

a549f76

fix flake8 warnings

40e9418

skip test, download needs too long

711980e

replace iterable dataset by dataset for parquet

5df2704

fix flake8 warnings

03876a6

test shuffle ParquetDataModule

c799227

add python version to dependency cache

9fce924

test flatten 2dim array with parquet and store shape in metadata

793ad5b

test data shape as column metadata

331447b

extract shape array from metadata to reshape dataset

db1c08c

update pyarrow to version 18.1.0

8596d68

BerndDoser added 22 commits December 4, 2024 13:37

fix dimension in parquet_numpy_file

af32604

add missing channel dimension

f2b5e1c

add data_column into shape string

03fbeae

add ParquetIterableDataset

6a449a1

add ParquetIterableDataModule

30091db

change memory usage of ParquetDataset

4a3cd64

instead of loading full dataset in memory read single rows in __getitem__

test parametrized batch_size and num_workers

b446ff2

Revert "change memory usage of ParquetDataset"

4a20c53

This reverts commit 4a3cd64.

Special treatment for the multi-worker case of ParquetIterableDataset

2ccdd3c

ParquetIterableDataset: reshape tensor instead of pandas dataframe

cc0e586

ParquetIterableDataset: change pandas to pydict

3729703

remove unused numpy import

0c71e50

Normalize ParquetDataModule and ParquetIterableDataModule

c901447

prepare column merge for ParquetDataset

fa1c943

ParquetDataset with two merged data columns

093ee82

allow list of strings for ParquetDataModule data_column

0ebd456

add normalize option to ParquetDataModule

2a95213

add absmax norm

4bc51b8

Merge branch 'main' into parquet

c1b830d

Merge branch 'main' into parquet

6bfc200

use Union to support Python 3.9

726fc8c

use Union for type hinting in ParquetDataModule

955918c

BerndDoser self-assigned this Jan 30, 2025

BerndDoser added this to the v0.4 milestone Jan 30, 2025

fix docstring for ParquetDataset

d9fe341

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Parquet as general dataset format #225

WIP: Parquet as general dataset format #225

BerndDoser commented Jan 30, 2025

WIP: Parquet as general dataset format #225

Are you sure you want to change the base?

WIP: Parquet as general dataset format #225

Conversation

BerndDoser commented Jan 30, 2025

ParquetDataset

ParquetIterableDataset