Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested struct support #4

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open

Nested struct support #4

wants to merge 39 commits into from

Conversation

philjdf
Copy link
Owner

@philjdf philjdf commented Dec 5, 2021

No description provided.

@@ -9,6 +9,8 @@ internal sealed class BufferedReader<TPhysicalvalue> where TPhysicalvalue : unma
{
public BufferedReader(ColumnReader reader, TPhysicalvalue[] values, short[]? defLevels, short[]? repLevels)
{
if (defLevels == null) throw new InvalidOperationException("definition levels not defined");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is defLevels now required here? If it actually is required then the parameter should probably be non-nullable, but I would think it still needs to be optional.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see BufferedReader is only used for reading arrays now. So we would expect repLevels to always be present too right? Also, would it be valid to have required arrays such that defLevels were not present?

If we are going to do a null check here then the private properties should probably be non-nullable so that the ?? -1 isn't required below.

Copy link

@adamreeve adamreeve Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a play around with this to try to understand things better. In PyArrow you can round-trip a required list, but we don't currently support reading this in ParquetSharp and crash with elementType is an array but schema does not match the expected layout. But you do always have defLevels set even for required lists due to the middle repeated "list" schema node. (would probably be nice to fix this later but not high priority)

And I see now that for nested fields without arrays, repLevels will be null, so that needs to stay as optional.

For non-nullable nested fields defLevels may not be defined though, so I think this null check is actually incorrect. However if I comment it out I get an "Invalid input stream" exception later due to checking defn.DefLevel < nullDefinitionLevel in MakeLeafReaderSingle where defn.DefLevel == -1. Eg. I can't currently read a file generated like this:

schema = pa.schema([pa.field('struct', type=pa.struct([pa.field('el', pa.int32(), nullable=False)]), nullable=False)])
table = pa.Table.from_arrays([[(1, ), (2, ), (3, )]], schema=schema)
pq.write_table(table, 'test.parquet')

It looks like the fix is as simple as changing the -1s below to 0, which makes more sense to me; if there are no definition levels required then everything must be implicitly defined at level 0 (and probably similarly for repLevels?). Would be good to add a unit test for this case too.

csharp.test/TestLogicalTypeRoundtrip.cs Outdated Show resolved Hide resolved
// TODO: Skip if elementType is a reference type?
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
elementType = elementType.BaseType != typeof(object) && elementType.BaseType != typeof(Array) ?
typeof(Nullable<>).MakeGenericType(elementType) : elementType;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would elementType.BaseType.IsValueType ? ... make more sense here?

else if (node is Schema.GroupNode && node.Parent != null)
{
// TODO: Skip if elementType is a reference type?
elementType = typeof(Nested<>).MakeGenericType(elementType);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I follow why we'd skip the nesting here for reference types. Wouldn't we always have nesting for group nodes? (except for LIST groups but those should already be skipped above, and I guess MAP key values which we don't yet support)

|| node.Parent.LogicalType.Type != LogicalTypeEnum.List
|| node.Parent.Repetition is not (Repetition.Optional or Repetition.Required))
{
throw new Exception("Schema not according to Parquet spec");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet docs do say, "A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field". But I don't think we currently support that right so throwing here isn't a breaking change?

@@ -25,6 +24,23 @@ public virtual void Dispose()
Source.Dispose();
}

protected static bool RepLevelsRequired(Type type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be private protected methods. The LogicalColumnStream class is public due to LogicalColumnReader etc being public, but we don't actually expect users to implement their own classes deriving from this.

var columnWriter = (ColumnWriter<TPhysical>)Source;

var writer0 = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), firstLeafRepLevel, false);
var writer = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), repetitionLevel, false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to wrap my head around why we need a writer0 and writer with different firstLeafRepLevel, would be good to try to have a thorough comment explaining this part.

return result.Length;
}

private Func<int, Array> MakeReader(Node[] schemaNodes, Type elementType, int repetitionLevel, int nullDefinitionLevel, bool wantSingleItem)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned on slack, switching to using a defintionLevel instead of nullDefinitionLevel as was done on master would probably simplify some things and seems more logical to me.

@philjdf
Copy link
Owner Author

philjdf commented Dec 15, 2021

Can't find a report on why it's failing the code formatting check...

@adamreeve
Copy link

adamreeve commented Dec 15, 2021

Hmm maybe we need to change the verbosity to show that, or show the diff. You should be able to run the tool locally to fix the formatting though, eg.:

dotnet tool restore
dotnet jb cleanupcode "csharp" "csharp.test" "csharp.benchmark" --profile="Built-in: Reformat Code" --settings="ParquetSharp.DotSettings"

@philjdf
Copy link
Owner Author

philjdf commented Dec 15, 2021

Ah great, thanks Adam! That command is in the log, sorry I didn't spot it.

@philjdf
Copy link
Owner Author

philjdf commented Dec 22, 2021

Took a look at maps:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'aaaa'), ('value', '1111')],
            [('key', 'bbbb'), ('value', '2222')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, 'map.parquet')

which gives a parquet file containing 3 columns: col1.key_value.key, col1.key_value.value, and col2. Would be nice if we could read/write these as string[], string[], and string respectively. Really we'd like to be able to combine the first 2 columns into a single Dictionary<string, string> column, but that's a higher level concern (similar to how the individual columns for fields in a struct could be combined with some reflection on the struct), Definitely not something for this PR. But would be nice to at least be able to read/write the individual columns. We currently fail with

System.ArgumentOutOfRangeException: 'unsupported logical type String with physical type ByteArray '

which is because the key is required, hence there's no match in LogicalTypeFactory.cs's DefaultPrimitiveMapping. Suspect there will be a similar thing trying to read/write required lists....

@philjdf
Copy link
Owner Author

philjdf commented Dec 29, 2021

Think it's nearly there with maps and required arrays too with the latest commit.

The thinking is that even if you have to mess around with schema Nodes directly, and deal with data at a lower level than you otherwise might want, at least being able to read/write/roundtrip data for these schemas supports more of the Parquet spec and increases confidence in the ParquetSharp array handling code. I.e. this doesn't try to do anything higher level, e.g. make the TKey[] and TValue[] columns appear as a single Dictionary<TKey, TValue> column.

Also the Python code appears to be doing integration testing, have added a map.parquet file here, maybe this needs some more thinking about.

@@ -134,6 +134,11 @@ public virtual unsafe (Type physicalType, Type logicalType) GetSystemTypes(Colum
return (typeof(ByteArray), typeof(byte[]));
}

if (logicalType.Type == LogicalTypeEnum.String)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here because we don't return earlier in the function because the default mapping doesn't match the column descriptor's required value.

@philjdf
Copy link
Owner Author

philjdf commented Jan 5, 2022

Think the only outstanding issue not covered by this thread is tests for how this new stuff plays with custom type handling.

@philjdf
Copy link
Owner Author

philjdf commented Feb 12, 2022

I did some benchmarking, given it appears reading arrays are slow and we probably need to address that soon. Therefore I wanted to check this doesn't introduce any regressions.

The benchmarks are floats = reading a time series of floats, and bigarray = reading a timeseries of float arrays. The float array benchmark is quite a big file (1GB) because I wanted to make sure there's good signal to noise to profile.

floats, netcoreapp3.1, latest master ParquetSharp:

Method Mean Error StdDev Ratio RatioSD Size (Bytes)
ParquetSharp 181.0 ms 3.49 ms 3.74 ms 1.00 0.00
ParquetDotNet 563.1 ms 11.25 ms 22.20 ms 3.03 0.10

floats, net5.0, latest master ParquetSharp:

Method Mean Error StdDev Ratio RatioSD Size (Bytes)
ParquetSharp 178.1 ms 3.45 ms 4.24 ms 1.00 0.00
ParquetDotNet 551.7 ms 10.91 ms 18.52 ms 3.10 0.13

bigarray, net5.0, latest master ParquetSharp:

Method Mean Error StdDev Ratio Size (Bytes)
ParquetSharp 12.38 s 0.043 s 0.040 s 1.00
ParquetDotNet 10.89 s 0.090 s 0.084 s 0.88

(ParquetDotNet seems to use a ton more memory)

floats, net5.0, this branch of ParquetSharp:

Method Mean Error StdDev Ratio RatioSD Size (Bytes)
ParquetSharp 178.6 ms 3.55 ms 6.32 ms 1.00 0.00
ParquetDotNet 531.1 ms 8.90 ms 8.32 ms 2.97 0.15

bigarray, net5.0, this branch of ParquetSharp:

Method Mean Error StdDev Ratio Size (Bytes)
ParquetSharp 12.90 s 0.163 s 0.152 s 1.00
ParquetDotNet 10.86 s 0.061 s 0.051 s 0.84

So in summary, netcoreapp3.1 -> net5.0 isn't a big difference, and at least as far as this test case goes, this branch doesn't slow us down but we're slower than ParquetDotNet.

I've committed to this branch:

  • Update to the latest ParquetDotNet
  • Add the float array benchmark.
  • Switched to net5.0 for the benchmark proj.

Benchmark machine: Ryzen 5950X, 64GB memory, Samsung 1TB 970 Evo, as few LEDs as possible (set to white).

@philjdf
Copy link
Owner Author

philjdf commented Feb 16, 2022

#5 is WIP (unit tests aren't passing yet) but it's down to 5s from 12.9s.

@philjdf philjdf marked this pull request as ready for review June 29, 2022 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants