-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested struct support #4
base: master
Are you sure you want to change the base?
Conversation
csharp/BufferedReader.cs
Outdated
@@ -9,6 +9,8 @@ internal sealed class BufferedReader<TPhysicalvalue> where TPhysicalvalue : unma | |||
{ | |||
public BufferedReader(ColumnReader reader, TPhysicalvalue[] values, short[]? defLevels, short[]? repLevels) | |||
{ | |||
if (defLevels == null) throw new InvalidOperationException("definition levels not defined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is defLevels
now required here? If it actually is required then the parameter should probably be non-nullable, but I would think it still needs to be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see BufferedReader is only used for reading arrays now. So we would expect repLevels to always be present too right? Also, would it be valid to have required arrays such that defLevels were not present?
If we are going to do a null check here then the private properties should probably be non-nullable so that the ?? -1
isn't required below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a play around with this to try to understand things better. In PyArrow you can round-trip a required list, but we don't currently support reading this in ParquetSharp and crash with elementType is an array but schema does not match the expected layout
. But you do always have defLevels set even for required lists due to the middle repeated "list" schema node. (would probably be nice to fix this later but not high priority)
And I see now that for nested fields without arrays, repLevels will be null, so that needs to stay as optional.
For non-nullable nested fields defLevels may not be defined though, so I think this null check is actually incorrect. However if I comment it out I get an "Invalid input stream" exception later due to checking defn.DefLevel < nullDefinitionLevel
in MakeLeafReaderSingle
where defn.DefLevel == -1
. Eg. I can't currently read a file generated like this:
schema = pa.schema([pa.field('struct', type=pa.struct([pa.field('el', pa.int32(), nullable=False)]), nullable=False)])
table = pa.Table.from_arrays([[(1, ), (2, ), (3, )]], schema=schema)
pq.write_table(table, 'test.parquet')
It looks like the fix is as simple as changing the -1
s below to 0
, which makes more sense to me; if there are no definition levels required then everything must be implicitly defined at level 0 (and probably similarly for repLevels?). Would be good to add a unit test for this case too.
csharp/ColumnDescriptor.cs
Outdated
// TODO: Skip if elementType is a reference type? | ||
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | ||
elementType = elementType.BaseType != typeof(object) && elementType.BaseType != typeof(Array) ? | ||
typeof(Nullable<>).MakeGenericType(elementType) : elementType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would elementType.BaseType.IsValueType ? ...
make more sense here?
else if (node is Schema.GroupNode && node.Parent != null) | ||
{ | ||
// TODO: Skip if elementType is a reference type? | ||
elementType = typeof(Nested<>).MakeGenericType(elementType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I follow why we'd skip the nesting here for reference types. Wouldn't we always have nesting for group nodes? (except for LIST groups but those should already be skipped above, and I guess MAP key values which we don't yet support)
|| node.Parent.LogicalType.Type != LogicalTypeEnum.List | ||
|| node.Parent.Repetition is not (Repetition.Optional or Repetition.Required)) | ||
{ | ||
throw new Exception("Schema not according to Parquet spec"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet docs do say, "A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field". But I don't think we currently support that right so throwing here isn't a breaking change?
@@ -25,6 +24,23 @@ public virtual void Dispose() | |||
Source.Dispose(); | |||
} | |||
|
|||
protected static bool RepLevelsRequired(Type type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these should be private protected
methods. The LogicalColumnStream
class is public due to LogicalColumnReader
etc being public, but we don't actually expect users to implement their own classes deriving from this.
var columnWriter = (ColumnWriter<TPhysical>)Source; | ||
|
||
var writer0 = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), firstLeafRepLevel, false); | ||
var writer = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), repetitionLevel, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to wrap my head around why we need a writer0
and writer
with different firstLeafRepLevel
, would be good to try to have a thorough comment explaining this part.
return result.Length; | ||
} | ||
|
||
private Func<int, Array> MakeReader(Node[] schemaNodes, Type elementType, int repetitionLevel, int nullDefinitionLevel, bool wantSingleItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned on slack, switching to using a defintionLevel
instead of nullDefinitionLevel
as was done on master would probably simplify some things and seems more logical to me.
Can't find a report on why it's failing the code formatting check... |
Hmm maybe we need to change the verbosity to show that, or show the diff. You should be able to run the tool locally to fix the formatting though, eg.:
|
Ah great, thanks Adam! That command is in the log, sorry I didn't spot it. |
Took a look at maps:
which gives a parquet file containing 3 columns:
which is because the key is required, hence there's no match in LogicalTypeFactory.cs's |
Think it's nearly there with maps and required arrays too with the latest commit. The thinking is that even if you have to mess around with schema Nodes directly, and deal with data at a lower level than you otherwise might want, at least being able to read/write/roundtrip data for these schemas supports more of the Parquet spec and increases confidence in the ParquetSharp array handling code. I.e. this doesn't try to do anything higher level, e.g. make the Also the Python code appears to be doing integration testing, have added a map.parquet file here, maybe this needs some more thinking about. |
@@ -134,6 +134,11 @@ public virtual unsafe (Type physicalType, Type logicalType) GetSystemTypes(Colum | |||
return (typeof(ByteArray), typeof(byte[])); | |||
} | |||
|
|||
if (logicalType.Type == LogicalTypeEnum.String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here because we don't return earlier in the function because the default mapping doesn't match the column descriptor's required value.
Think the only outstanding issue not covered by this thread is tests for how this new stuff plays with custom type handling. |
I did some benchmarking, given it appears reading arrays are slow and we probably need to address that soon. Therefore I wanted to check this doesn't introduce any regressions. The benchmarks are floats = reading a time series of floats, and bigarray = reading a timeseries of float arrays. The float array benchmark is quite a big file (1GB) because I wanted to make sure there's good signal to noise to profile. floats, netcoreapp3.1, latest master ParquetSharp:
floats, net5.0, latest master ParquetSharp:
bigarray, net5.0, latest master ParquetSharp:
(ParquetDotNet seems to use a ton more memory) floats, net5.0, this branch of ParquetSharp:
bigarray, net5.0, this branch of ParquetSharp:
So in summary, netcoreapp3.1 -> net5.0 isn't a big difference, and at least as far as this test case goes, this branch doesn't slow us down but we're slower than ParquetDotNet. I've committed to this branch:
Benchmark machine: Ryzen 5950X, 64GB memory, Samsung 1TB 970 Evo, as few LEDs as possible (set to white). |
#5 is WIP (unit tests aren't passing yet) but it's down to 5s from 12.9s. |
No description provided.