POC (do not merge): Add row-group and column select #222

martindurant · 2024-10-09T13:41:23Z

Makes read_parquet(filename, rgs=None, columns=None), where the optional inputs are integer lists. To be useful, the user needs to first know how many row-groups there are (this is easy) and how the columns they want map onto schema indices (this is hard for the nested case).

I note that loading refuses (correctly) in the case that you have a MAP type and you only specify the keys or values but not both.

I also found that "filename" must be a real file not a directory, which I found surprising, since parquet datasets are usually multi-file. Of course dask et al already has code for extracting the potentially many files of a dataset, first constructing an aggregate schema including any partitioning keys. I suppose this paragraph belongs in #195 .

kylebarron · 2024-10-09T16:16:45Z

Thanks for the PR!

Makes read_parquet(filename, rgs=None, columns=None), where the optional inputs are integer lists. To be useful, the user needs to first know how many row-groups there are (this is easy) and how the columns they want map onto schema indices (this is hard for the nested case).

Right, this is where it's nicer to have a stateful reader like ParquetFile. For reference, I have ParquetFile and ParquetDataset in geoarrow-rs. These should work for non-spatial data, but are spatial-aware.

I also found that "filename" must be a real file not a directory, which I found surprising, since parquet datasets are usually multi-file. Of course dask et al already has code for extracting the potentially many files of a dataset, first constructing an aggregate schema including any partitioning keys. I suppose this paragraph belongs in #195 .

The Rust parquet crate API only supports single files, so any dataset abstraction needs to be built on top of that. You can refer to ParquetDataset in geoarrow-rs: https://github.com/geoarrow/geoarrow-rs/blob/d4e6957bd73f70f9d00a8b6b8c2d66f7c0622deb/python/geoarrow-io/src/io/parquet/async.rs#L313-L321

martindurant · 2024-10-10T14:45:02Z

Right, this is where it's nicer to have a stateful reader like ParquetFile. For reference, I have ParquetFile and ParquetDataset in geoarrow-rs. These should work for non-spatial data, but are spatial-aware.

I suppose I am a little confused about how you see there different packages relating to each other. Clearly geoarrow cares about parquet, but it also uses arro3 internally, so why wouldn't dataset live here? I see it also supports remote stores...

geoarrow, of course, depends on the whole of pyarrow (all that C++ stuff). Not to mention GEOS and GDAL!!

kylebarron · 2024-10-10T15:08:01Z

I suppose I am a little confused about how you see there different packages relating to each other. Clearly geoarrow cares about parquet, but it also uses arro3 internally, so why wouldn't dataset live here? I see it also supports remote stores...

Because I don't know of a good way (yet) to have the main Python binding live here but inject spatial-specific extensions from geoarrow-rs's Python bindings. If we figured that out then I would put the core Python parquet bindings here.

It's the same story on the JS side, where I have WebAssembly bindings to Parquet, but then reimplement a bunch of that stuff for GeoParquet in JS. https://github.com/geoarrow/geoarrow-rs/tree/main/js/src/io/parquet

geoarrow, of course, depends on the whole of pyarrow (all that C++ stuff). Not to mention GEOS and GDAL!!

No. geoarrow-rs does not depend on pyarrow, nor GEOS nor GDAL. geoarrow-rs is pure Rust, and especially the GeoParquet reader and all the core dependencies are pure Rust. That's what allows it to go in WebAssembly so easily https://explore.overturemaps.org (clicking Download Visible uses my Rust GeoParquet reader). geoarrow-rs is separate from geoarrow-c and geoarrow-pyarrow, which depend on pyarrow.

kylebarron · 2024-10-10T15:09:37Z

Clearly geoarrow cares about parquet, but it also uses arro3 internally

To be pedantic, the geoarrow crate does not use arro3 internally. The geoarrow crate and its Python bindings only use pyo3-arrow. arro3-core is listed as a Python dependency only so that users have a friendly API to work with Arrow table data.

kylebarron · 2024-10-09T16:11:37Z

arro3-io/src/parquet.rs

+pub fn read_parquet(
+    py: Python,
+    file: FileReader,
+    rgs: Option<Vec<usize>>,


I'd call this row_groups for symmetry with pyarrow https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.read_row_groups

Certainly! It really ought to be on the reader (.select() or such) anyway, so that you only read the parquet metadata once. Or maybe have a separate "fetch details" function. This was only minimalist POC to show that you can pick data portions like this.

Add row-group and column select

1b8904f

kylebarron reviewed Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC (do not merge): Add row-group and column select #222

POC (do not merge): Add row-group and column select #222

martindurant commented Oct 9, 2024

kylebarron commented Oct 9, 2024

martindurant commented Oct 10, 2024 •

edited

Loading

kylebarron commented Oct 10, 2024

kylebarron commented Oct 10, 2024

kylebarron Oct 9, 2024

martindurant Oct 10, 2024

POC (do not merge): Add row-group and column select #222

Are you sure you want to change the base?

POC (do not merge): Add row-group and column select #222

Conversation

martindurant commented Oct 9, 2024

kylebarron commented Oct 9, 2024

martindurant commented Oct 10, 2024 • edited Loading

kylebarron commented Oct 10, 2024

kylebarron commented Oct 10, 2024

kylebarron Oct 9, 2024

Choose a reason for hiding this comment

martindurant Oct 10, 2024

Choose a reason for hiding this comment

martindurant commented Oct 10, 2024 •

edited

Loading