fix: support root empty field in Parquet file #1619

agoose77 · 2022-08-23T10:30:47Z

Whilst we strip redundant empty record-structures from the metadata in ak.from_parquet, we later use the raw table to pull out the columns. This means that we currently generate the wrong column names for such a table, and the corresponding result of ak.from_parquet with specified columns is an empty record array.

This should fix #1606.

@jpivarski / @martindurant your input would you appreciated here just to validate my understanding is correct - I'm not entirely sure why some parquet files have these root empty records. Is it something that we were doing when generating parquet files in the past, and now we're reading them back?

PS - apologies for the push to main everyone; I switched branches in my IDE to check something, and didn't realise when I pushed & committed from the CLI.

codecov · 2022-08-23T10:43:48Z

Codecov Report

Merging #1619 (898b1bb) into main (9e17f29) will increase coverage by 0.37%.
The diff coverage is 65.36%.

Impacted Files	Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/_connect/numexpr.py	`88.40% <0.00%> (ø)`
src/awkward/_v2/_connect/pyarrow.py	`88.46% <0.00%> (ø)`
...awkward/_v2/_connect/rdataframe/from_rdataframe.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/_lookup.py	`98.68% <ø> (+1.17%)`	⬆️
src/awkward/_v2/numba.py	`93.47% <0.00%> (ø)`
src/awkward/_v2/operations/ak_from_avro_file.py	`66.66% <0.00%> (ø)`
src/awkward/_v2/operations/ak_local_index.py	`100.00% <ø> (ø)`
src/awkward/_v2/operations/ak_max.py	`87.09% <ø> (ø)`
src/awkward/_v2/operations/ak_min.py	`87.09% <ø> (ø)`
... and 54 more

agoose77 · 2022-08-23T11:03:32Z

OK, the fix segfaults (in PyArrow) for extension arrays. Let's try and unpack this.

martindurant · 2022-08-23T13:11:15Z

This does ring a bell, I recall columns not being what I thought they would be in some test I didn't end up writing.

jpivarski · 2022-08-23T14:58:48Z

What Awkward Forms call "columns" doesn't include the ".list" (in Arrow-style Parquet) or ".element" (in compliant/Spark-style Parquet) words at each level where there's a list. When a Form is asked to create a list of column name strings for Parquet, it inserts either the ".list" or the ".element"`, based on what kind of file it thinks it is (@martindurant added some detection for that).

If you're encountering segfaults when Arrow is trying to read an ExtensionType, first be sure that you're on the latest version of Arrow (they move quickly) and check to see if it's a known issue. (That's a list of Arrow issues that I've opened, but I think that I've opened more issues about corner cases in Parquet and ExtensionType than anyone else...)

agoose77 · 2022-08-23T15:21:19Z

What Awkward Forms call "columns" doesn't include the ".list" (in Arrow-style Parquet) or ".element" (in compliant/Spark-style Parquet) words at each level where there's a list. When a Form is asked to create a list of column name strings for Parquet, it inserts either the ".list" or the ".element"`, based on what kind of file it thinks it is (@martindurant added some detection for that).

If you're encountering segfaults when Arrow is trying to read an ExtensionType, first be sure that you're on the latest version of Arrow (they move quickly) and check to see if it's a known issue. (That's a list of Arrow issues that I've opened, but I think that I've opened more issues about corner cases in Parquet and ExtensionType than anyone else...)

Yes, I'm following that part. The bit I'm not sure about is why anyone would author a schema with an empty root field (requiring ".list.item").

As for the segfault, I'm testing locally on 9.0.0, so maybe we'll need to file an issue. I don't yet know enough about extension types to know whether i'm doing something wrong (even if it should never segfault vs loudly failing).

jpivarski · 2022-08-23T15:51:54Z

If there's a difference between Arrow 7.0.0, 8.0.0, or 9.0.0 then it's an Arrow bug. Version 7.0.0 is the first that supports ExtensionTypes in the way that we want to use them, and most of the testing/development was done on 8.0.0.

The bit I'm not sure about is why anyone would author a schema with an empty root field (requiring ".list.item").

An Awkward Array that doesn't have any RecordArrays or has a list-type outside of the record-type is represented in pyarrow.Table as a single-field with the empty string as a name, so column names start with a dot (".list.item.*"). It's a mis-match between Awkward's data model and Arrow/Parquet's: Arrow (Table) and Parquet always have a record-type at top-level, Awkward doesn't always.

It's a pretty common to work with separated ragged arrays (columns named by Python variable names, rather than a record-type in Awkward), and although it would make the most sense to zip these before writing them to a Parquet file, but I'm pretty sure that many of these users are going to write separate Parquet files (columns named by file paths in the file system, rather than columns within a Parquet file). Maybe we'd want to hint at better usage patterns, but writing 10 ragged arrays to 10 separate files is not wrong and shouldn't fail. That would be a definite case in which a Parquet writer would write files with ".list.item.*" columns.

agoose77 · 2022-08-23T16:57:48Z

An Awkward Array that doesn't have any RecordArrays or has a list-type outside of the record-type is represented in pyarrow.Table as a single-field with the empty string as a name, so column names start with a dot (".list.item.*"). It's a mis-match between Awkward's data model and Arrow/Parquet's: Arrow (Table) and Parquet always have a record-type at top-level, Awkward doesn't always.

That sounds familiar, thanks for the explanation. So Awkward-authored data probably used to (or still does, I haven't looked) use an empty string as a workaround for this requirement.

jpivarski · 2022-08-23T17:18:08Z

Still does. As long as the empty string is an allowed field name in pyarrow.Table, it's a good mapping. No reasonable fields for data analysis are going to be named "", so it's a part of the space we can use without being worried about clobbering a legitimate field name. If that possibility is closed down (e.g. Arrow decides that field names must be valid identifiers), then the next-best thing is "_".

martindurant · 2022-08-23T17:29:29Z

The parquet spec itself requires the column names be str, so "" is very much allowed. It wouldn't surprise me too much if it confused some tooling in the chain, though, since string splitting and manipulations might loose the empty piece. Note that most parquet schemas have an invisible root element called "schema", which is not part of the spec but yet another hidden convention.

martindurant · 2022-08-23T17:30:19Z

The fastparquet view of the schema for that dataset:

- schema: REQUIRED
  - : LIST, LIST, REQUIRED
    - list: REPEATED
      - item: REQUIRED
      | - trip: REQUIRED
      | | - sec: FLOAT, OPTIONAL
      | | - km: FLOAT, OPTIONAL
      | | - begin: REQUIRED
      | | | - lon: DOUBLE, OPTIONAL
      | | | - lat: DOUBLE, OPTIONAL
      | |   - time: INT64, TIMESTAMP[MILLIS], TIMESTAMP_MILLIS, OPTIONAL
      | | - end: REQUIRED
      | | | - lon: DOUBLE, OPTIONAL
      | | | - lat: DOUBLE, OPTIONAL
      | |   - time: INT64, TIMESTAMP[MILLIS], TIMESTAMP_MILLIS, OPTIONAL
      |   - path: LIST, LIST, REQUIRED
      |     - list: REPEATED
      |       - item: REQUIRED
      |       | - londiff: FLOAT, REQUIRED
      |         - latdiff: FLOAT, REQUIRED
      | - payment: REQUIRED
      | | - fare: FLOAT, OPTIONAL
      | | - tips: FLOAT, OPTIONAL
      | | - total: FLOAT, OPTIONAL
      |   - type: BYTE_ARRAY, STRING, UTF8, REQUIRED
        - company: BYTE_ARRAY, STRING, UTF8, REQUIRED

…empty-field

agoose77 · 2022-08-25T20:39:02Z

OK, I'm hoping this works. I removed the test that causes the crash because

it's in PyArrow / extension arrays
it was already failing in main

I'll file a new issue to track this going forward, so we can merge this one.

jpivarski

The tests did pass and I moved investigation of the segfault to #1635, so go ahead and squash-and-merge when you're ready.

* refactor: remove duplicate loc * fix: specify column_prefix * test: add test for scikit-hep#1619 * test: update test to show crash happens with extension arrays

agoose77 added 2 commits August 23, 2022 11:26

refactor: remove duplicate loc

cb167d2

fix: specify column_prefix

8e40f2f

test: add test for #1619

c253164

agoose77 marked this pull request as draft August 23, 2022 11:07

Merge remote-tracking branch 'origin/main' into agoose77/fix-pyarrow-…

83e0b88

…empty-field

agoose77 mentioned this pull request Aug 25, 2022

[DO NOT MERGE] Diff that fixes ak.from_parquet column selection #1634

Closed

test: update test to show crash happens with extension arrays

db4e695

agoose77 marked this pull request as ready for review August 25, 2022 20:35

Merge branch 'main' into agoose77/fix-pyarrow-empty-field

8c2c2f1

jpivarski mentioned this pull request Aug 25, 2022

Segfault in pyarrow observed in #1619 #1635

Closed

jpivarski approved these changes Aug 25, 2022

View reviewed changes

Merge branch 'main' into agoose77/fix-pyarrow-empty-field

898b1bb

agoose77 enabled auto-merge (squash) August 26, 2022 11:39

agoose77 merged commit 229c09f into main Aug 26, 2022

agoose77 deleted the agoose77/fix-pyarrow-empty-field branch August 26, 2022 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support root empty field in Parquet file #1619

fix: support root empty field in Parquet file #1619

agoose77 commented Aug 23, 2022 •

edited

Loading

codecov bot commented Aug 23, 2022 •

edited

Loading

agoose77 commented Aug 23, 2022

martindurant commented Aug 23, 2022

jpivarski commented Aug 23, 2022

agoose77 commented Aug 23, 2022

jpivarski commented Aug 23, 2022

agoose77 commented Aug 23, 2022

jpivarski commented Aug 23, 2022

martindurant commented Aug 23, 2022

martindurant commented Aug 23, 2022

agoose77 commented Aug 25, 2022

jpivarski left a comment

fix: support root empty field in Parquet file #1619

fix: support root empty field in Parquet file #1619

Conversation

agoose77 commented Aug 23, 2022 • edited Loading

codecov bot commented Aug 23, 2022 • edited Loading

Codecov Report

agoose77 commented Aug 23, 2022

martindurant commented Aug 23, 2022

jpivarski commented Aug 23, 2022

agoose77 commented Aug 23, 2022

jpivarski commented Aug 23, 2022

agoose77 commented Aug 23, 2022

jpivarski commented Aug 23, 2022

martindurant commented Aug 23, 2022

martindurant commented Aug 23, 2022

agoose77 commented Aug 25, 2022

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Aug 23, 2022 •

edited

Loading

codecov bot commented Aug 23, 2022 •

edited

Loading