Make read_parquet_dataset() less I/O intense #161

hombit · 2023-11-07T16:30:39Z

Currently hipscat.io.file_io.read_parquet_dataset() makes some assumptions which could be suboptimal for some use-cases. For instance, it sets pyarrow.dataset.dataset(exclude_invalid_files=True) which force-scan the whole catalog. I propose to make this behavior optional.

This change would also require updating ignore_prefixes, which currently doesn't include JSON and FITS metadata files, which makes pyarrow.dataset.datase believe that they are parquet files. Probably, we could also allow user to provide custom ignore_prefixes.

The probable solution would be adding **kwargs to the function and allow to pass any arguments to pyarrow.dataset.dataset arguments.

The text was updated successfully, but these errors were encountered:

This was referenced Nov 9, 2023

Save catalog to disk astronomy-commons/lsdb#61

Merged

Use kwargs for dataset read. #164

Merged

delucchi-cmu self-assigned this Nov 13, 2023

delucchi-cmu closed this as completed in #164 Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make read_parquet_dataset() less I/O intense #161

Make read_parquet_dataset() less I/O intense #161

hombit commented Nov 7, 2023 •

edited

Loading

Make read_parquet_dataset() less I/O intense #161

Make read_parquet_dataset() less I/O intense #161

Comments

hombit commented Nov 7, 2023 • edited Loading

hombit commented Nov 7, 2023 •

edited

Loading