-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support root empty field in Parquet file #1619
Conversation
Codecov Report
|
OK, the fix segfaults (in PyArrow) for extension arrays. Let's try and unpack this. |
This does ring a bell, I recall columns not being what I thought they would be in some test I didn't end up writing. |
What Awkward Forms call "columns" doesn't include the If you're encountering segfaults when Arrow is trying to read an ExtensionType, first be sure that you're on the latest version of Arrow (they move quickly) and check to see if it's a known issue. (That's a list of Arrow issues that I've opened, but I think that I've opened more issues about corner cases in Parquet and ExtensionType than anyone else...) |
Yes, I'm following that part. The bit I'm not sure about is why anyone would author a schema with an empty root field (requiring As for the segfault, I'm testing locally on 9.0.0, so maybe we'll need to file an issue. I don't yet know enough about extension types to know whether i'm doing something wrong (even if it should never segfault vs loudly failing). |
If there's a difference between Arrow 7.0.0, 8.0.0, or 9.0.0 then it's an Arrow bug. Version 7.0.0 is the first that supports ExtensionTypes in the way that we want to use them, and most of the testing/development was done on 8.0.0.
An Awkward Array that doesn't have any RecordArrays or has a list-type outside of the record-type is represented in It's a pretty common to work with separated ragged arrays (columns named by Python variable names, rather than a record-type in Awkward), and although it would make the most sense to zip these before writing them to a Parquet file, but I'm pretty sure that many of these users are going to write separate Parquet files (columns named by file paths in the file system, rather than columns within a Parquet file). Maybe we'd want to hint at better usage patterns, but writing 10 ragged arrays to 10 separate files is not wrong and shouldn't fail. That would be a definite case in which a Parquet writer would write files with |
That sounds familiar, thanks for the explanation. So Awkward-authored data probably used to (or still does, I haven't looked) use an empty string as a workaround for this requirement. |
Still does. As long as the empty string is an allowed field name in |
The parquet spec itself requires the column names be str, so "" is very much allowed. It wouldn't surprise me too much if it confused some tooling in the chain, though, since string splitting and manipulations might loose the empty piece. Note that most parquet schemas have an invisible root element called "schema", which is not part of the spec but yet another hidden convention. |
The fastparquet view of the schema for that dataset:
|
OK, I'm hoping this works. I removed the test that causes the crash because
I'll file a new issue to track this going forward, so we can merge this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests did pass and I moved investigation of the segfault to #1635, so go ahead and squash-and-merge when you're ready.
* refactor: remove duplicate loc * fix: specify column_prefix * test: add test for scikit-hep#1619 * test: update test to show crash happens with extension arrays
Whilst we strip redundant empty record-structures from the metadata in
ak.from_parquet
, we later use the raw table to pull out the columns. This means that we currently generate the wrong column names for such a table, and the corresponding result ofak.from_parquet
with specified columns is an empty record array.This should fix #1606.
@jpivarski / @martindurant your input would you appreciated here just to validate my understanding is correct - I'm not entirely sure why some parquet files have these root empty records. Is it something that we were doing when generating parquet files in the past, and now we're reading them back?
PS - apologies for the push to main everyone; I switched branches in my IDE to check something, and didn't realise when I pushed & committed from the CLI.