Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support root empty field in Parquet file #1619

Merged
merged 7 commits into from
Aug 26, 2022

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Aug 23, 2022

Whilst we strip redundant empty record-structures from the metadata in ak.from_parquet, we later use the raw table to pull out the columns. This means that we currently generate the wrong column names for such a table, and the corresponding result of ak.from_parquet with specified columns is an empty record array.

This should fix #1606.

@jpivarski / @martindurant your input would you appreciated here just to validate my understanding is correct - I'm not entirely sure why some parquet files have these root empty records. Is it something that we were doing when generating parquet files in the past, and now we're reading them back?

PS - apologies for the push to main everyone; I switched branches in my IDE to check something, and didn't realise when I pushed & committed from the CLI.

@codecov
Copy link

codecov bot commented Aug 23, 2022

Codecov Report

Merging #1619 (898b1bb) into main (9e17f29) will increase coverage by 0.37%.
The diff coverage is 65.36%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/numexpr.py 88.40% <0.00%> (ø)
src/awkward/_v2/_connect/pyarrow.py 88.46% <0.00%> (ø)
...awkward/_v2/_connect/rdataframe/from_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/_lookup.py 98.68% <ø> (+1.17%) ⬆️
src/awkward/_v2/numba.py 93.47% <0.00%> (ø)
src/awkward/_v2/operations/ak_from_avro_file.py 66.66% <0.00%> (ø)
src/awkward/_v2/operations/ak_local_index.py 100.00% <ø> (ø)
src/awkward/_v2/operations/ak_max.py 87.09% <ø> (ø)
src/awkward/_v2/operations/ak_min.py 87.09% <ø> (ø)
... and 54 more

@agoose77
Copy link
Collaborator Author

OK, the fix segfaults (in PyArrow) for extension arrays. Let's try and unpack this.

@agoose77 agoose77 marked this pull request as draft August 23, 2022 11:07
@martindurant
Copy link
Contributor

This does ring a bell, I recall columns not being what I thought they would be in some test I didn't end up writing.

@jpivarski
Copy link
Member

What Awkward Forms call "columns" doesn't include the ".list" (in Arrow-style Parquet) or ".element" (in compliant/Spark-style Parquet) words at each level where there's a list. When a Form is asked to create a list of column name strings for Parquet, it inserts either the ".list" or the ".element"`, based on what kind of file it thinks it is (@martindurant added some detection for that).

If you're encountering segfaults when Arrow is trying to read an ExtensionType, first be sure that you're on the latest version of Arrow (they move quickly) and check to see if it's a known issue. (That's a list of Arrow issues that I've opened, but I think that I've opened more issues about corner cases in Parquet and ExtensionType than anyone else...)

@agoose77
Copy link
Collaborator Author

What Awkward Forms call "columns" doesn't include the ".list" (in Arrow-style Parquet) or ".element" (in compliant/Spark-style Parquet) words at each level where there's a list. When a Form is asked to create a list of column name strings for Parquet, it inserts either the ".list" or the ".element"`, based on what kind of file it thinks it is (@martindurant added some detection for that).

If you're encountering segfaults when Arrow is trying to read an ExtensionType, first be sure that you're on the latest version of Arrow (they move quickly) and check to see if it's a known issue. (That's a list of Arrow issues that I've opened, but I think that I've opened more issues about corner cases in Parquet and ExtensionType than anyone else...)

Yes, I'm following that part. The bit I'm not sure about is why anyone would author a schema with an empty root field (requiring ".list.item").

As for the segfault, I'm testing locally on 9.0.0, so maybe we'll need to file an issue. I don't yet know enough about extension types to know whether i'm doing something wrong (even if it should never segfault vs loudly failing).

@jpivarski
Copy link
Member

If there's a difference between Arrow 7.0.0, 8.0.0, or 9.0.0 then it's an Arrow bug. Version 7.0.0 is the first that supports ExtensionTypes in the way that we want to use them, and most of the testing/development was done on 8.0.0.

The bit I'm not sure about is why anyone would author a schema with an empty root field (requiring ".list.item").

An Awkward Array that doesn't have any RecordArrays or has a list-type outside of the record-type is represented in pyarrow.Table as a single-field with the empty string as a name, so column names start with a dot (".list.item.*"). It's a mis-match between Awkward's data model and Arrow/Parquet's: Arrow (Table) and Parquet always have a record-type at top-level, Awkward doesn't always.

It's a pretty common to work with separated ragged arrays (columns named by Python variable names, rather than a record-type in Awkward), and although it would make the most sense to zip these before writing them to a Parquet file, but I'm pretty sure that many of these users are going to write separate Parquet files (columns named by file paths in the file system, rather than columns within a Parquet file). Maybe we'd want to hint at better usage patterns, but writing 10 ragged arrays to 10 separate files is not wrong and shouldn't fail. That would be a definite case in which a Parquet writer would write files with ".list.item.*" columns.

@agoose77
Copy link
Collaborator Author

An Awkward Array that doesn't have any RecordArrays or has a list-type outside of the record-type is represented in pyarrow.Table as a single-field with the empty string as a name, so column names start with a dot (".list.item.*"). It's a mis-match between Awkward's data model and Arrow/Parquet's: Arrow (Table) and Parquet always have a record-type at top-level, Awkward doesn't always.

That sounds familiar, thanks for the explanation. So Awkward-authored data probably used to (or still does, I haven't looked) use an empty string as a workaround for this requirement.

@jpivarski
Copy link
Member

Still does. As long as the empty string is an allowed field name in pyarrow.Table, it's a good mapping. No reasonable fields for data analysis are going to be named "", so it's a part of the space we can use without being worried about clobbering a legitimate field name. If that possibility is closed down (e.g. Arrow decides that field names must be valid identifiers), then the next-best thing is "_".

@martindurant
Copy link
Contributor

The parquet spec itself requires the column names be str, so "" is very much allowed. It wouldn't surprise me too much if it confused some tooling in the chain, though, since string splitting and manipulations might loose the empty piece. Note that most parquet schemas have an invisible root element called "schema", which is not part of the spec but yet another hidden convention.

@martindurant
Copy link
Contributor

The fastparquet view of the schema for that dataset:

- schema: REQUIRED
  - : LIST, LIST, REQUIRED
    - list: REPEATED
      - item: REQUIRED
      | - trip: REQUIRED
      | | - sec: FLOAT, OPTIONAL
      | | - km: FLOAT, OPTIONAL
      | | - begin: REQUIRED
      | | | - lon: DOUBLE, OPTIONAL
      | | | - lat: DOUBLE, OPTIONAL
      | |   - time: INT64, TIMESTAMP[MILLIS], TIMESTAMP_MILLIS, OPTIONAL
      | | - end: REQUIRED
      | | | - lon: DOUBLE, OPTIONAL
      | | | - lat: DOUBLE, OPTIONAL
      | |   - time: INT64, TIMESTAMP[MILLIS], TIMESTAMP_MILLIS, OPTIONAL
      |   - path: LIST, LIST, REQUIRED
      |     - list: REPEATED
      |       - item: REQUIRED
      |       | - londiff: FLOAT, REQUIRED
      |         - latdiff: FLOAT, REQUIRED
      | - payment: REQUIRED
      | | - fare: FLOAT, OPTIONAL
      | | - tips: FLOAT, OPTIONAL
      | | - total: FLOAT, OPTIONAL
      |   - type: BYTE_ARRAY, STRING, UTF8, REQUIRED
        - company: BYTE_ARRAY, STRING, UTF8, REQUIRED

@agoose77 agoose77 marked this pull request as ready for review August 25, 2022 20:35
@agoose77
Copy link
Collaborator Author

OK, I'm hoping this works. I removed the test that causes the crash because

  • it's in PyArrow / extension arrays
  • it was already failing in main

I'll file a new issue to track this going forward, so we can merge this one.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests did pass and I moved investigation of the segfault to #1635, so go ahead and squash-and-merge when you're ready.

@agoose77 agoose77 enabled auto-merge (squash) August 26, 2022 11:39
@agoose77 agoose77 merged commit 229c09f into main Aug 26, 2022
@agoose77 agoose77 deleted the agoose77/fix-pyarrow-empty-field branch August 26, 2022 12:09
Saransh-cpp pushed a commit to Saransh-cpp/awkward that referenced this pull request Aug 29, 2022
* refactor: remove duplicate loc

* fix: specify column_prefix

* test: add test for scikit-hep#1619

* test: update test to show crash happens with extension arrays
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.from_parquet returns empty array when columns are specified
3 participants