Nested struct support #4

philjdf · 2021-12-05T23:02:23Z

No description provided.

adamreeve · 2021-12-06T04:48:39Z

csharp/BufferedReader.cs

@@ -9,6 +9,8 @@ internal sealed class BufferedReader<TPhysicalvalue> where TPhysicalvalue : unma
    {
        public BufferedReader(ColumnReader reader, TPhysicalvalue[] values, short[]? defLevels, short[]? repLevels)
        {
+            if (defLevels == null) throw new InvalidOperationException("definition levels not defined");


Why is defLevels now required here? If it actually is required then the parameter should probably be non-nullable, but I would think it still needs to be optional.

Ah I see BufferedReader is only used for reading arrays now. So we would expect repLevels to always be present too right? Also, would it be valid to have required arrays such that defLevels were not present?

If we are going to do a null check here then the private properties should probably be non-nullable so that the ?? -1 isn't required below.

I had a play around with this to try to understand things better. In PyArrow you can round-trip a required list, but we don't currently support reading this in ParquetSharp and crash with elementType is an array but schema does not match the expected layout. But you do always have defLevels set even for required lists due to the middle repeated "list" schema node. (would probably be nice to fix this later but not high priority)

And I see now that for nested fields without arrays, repLevels will be null, so that needs to stay as optional.

For non-nullable nested fields defLevels may not be defined though, so I think this null check is actually incorrect. However if I comment it out I get an "Invalid input stream" exception later due to checking defn.DefLevel < nullDefinitionLevel in MakeLeafReaderSingle where defn.DefLevel == -1. Eg. I can't currently read a file generated like this:

schema = pa.schema([pa.field('struct', type=pa.struct([pa.field('el', pa.int32(), nullable=False)]), nullable=False)]) table = pa.Table.from_arrays([[(1, ), (2, ), (3, )]], schema=schema) pq.write_table(table, 'test.parquet')

It looks like the fix is as simple as changing the -1s below to 0, which makes more sense to me; if there are no definition levels required then everything must be implicitly defined at level 0 (and probably similarly for repLevels?). Would be good to add a unit test for this case too.

csharp.test/TestLogicalTypeRoundtrip.cs

adamreeve · 2021-12-06T05:00:46Z

csharp/ColumnDescriptor.cs

+                    // TODO: Skip if elementType is a reference type?
+                    // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+                    elementType = elementType.BaseType != typeof(object) && elementType.BaseType != typeof(Array) ?
+                        typeof(Nullable<>).MakeGenericType(elementType) : elementType;


Would elementType.BaseType.IsValueType ? ... make more sense here?

adamreeve · 2021-12-06T06:40:02Z

csharp/ColumnDescriptor.cs

+                else if (node is Schema.GroupNode && node.Parent != null)
+                {
+                    // TODO: Skip if elementType is a reference type?
+                    elementType = typeof(Nested<>).MakeGenericType(elementType);


I don't think I follow why we'd skip the nesting here for reference types. Wouldn't we always have nesting for group nodes? (except for LIST groups but those should already be skipped above, and I guess MAP key values which we don't yet support)

adamreeve · 2021-12-06T06:44:16Z

csharp/ColumnDescriptor.cs

+                        || node.Parent.LogicalType.Type != LogicalTypeEnum.List
+                        || node.Parent.Repetition is not (Repetition.Optional or Repetition.Required))
+                    {
+                        throw new Exception("Schema not according to Parquet spec");


The parquet docs do say, "A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field". But I don't think we currently support that right so throwing here isn't a breaking change?

adamreeve · 2021-12-07T01:35:30Z

csharp/LogicalColumnStream.cs

@@ -25,6 +24,23 @@ public virtual void Dispose()
            Source.Dispose();
        }

+        protected static bool RepLevelsRequired(Type type)


I wonder if these should be private protected methods. The LogicalColumnStream class is public due to LogicalColumnReader etc being public, but we don't actually expect users to implement their own classes deriving from this.

csharp/CompoundLogicalColumnReader.cs

adamreeve · 2021-12-07T02:24:32Z

csharp/CompoundLogicalColumnWriter.cs

+            var columnWriter = (ColumnWriter<TPhysical>)Source;
+
+            var writer0 = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), firstLeafRepLevel, false);
+            var writer = MakeWriter(schemaNodes, elementType, (short)(repetitionLevel + 1), (short)(nullDefinitionLevel + 2), repetitionLevel, false);


It took me a while to wrap my head around why we need a writer0 and writer with different firstLeafRepLevel, would be good to try to have a thorough comment explaining this part.

csharp/CompoundLogicalColumnWriter.cs

adamreeve · 2021-12-07T02:57:58Z

csharp/CompoundLogicalColumnReader.cs

+            return result.Length;
+        }
+
+        private Func<int, Array> MakeReader(Node[] schemaNodes, Type elementType, int repetitionLevel, int nullDefinitionLevel, bool wantSingleItem)


As mentioned on slack, switching to using a defintionLevel instead of nullDefinitionLevel as was done on master would probably simplify some things and seems more logical to me.

philjdf · 2021-12-15T21:42:30Z

Can't find a report on why it's failing the code formatting check...

adamreeve · 2021-12-15T21:46:17Z

Hmm maybe we need to change the verbosity to show that, or show the diff. You should be able to run the tool locally to fix the formatting though, eg.:

dotnet tool restore
dotnet jb cleanupcode "csharp" "csharp.test" "csharp.benchmark" --profile="Built-in: Reformat Code" --settings="ParquetSharp.DotSettings"

philjdf · 2021-12-15T22:49:50Z

Ah great, thanks Adam! That command is in the log, sorry I didn't spot it.

philjdf · 2021-12-22T20:28:21Z

Took a look at maps:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'aaaa'), ('value', '1111')],
            [('key', 'bbbb'), ('value', '2222')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, 'map.parquet')

which gives a parquet file containing 3 columns: col1.key_value.key, col1.key_value.value, and col2. Would be nice if we could read/write these as string[], string[], and string respectively. Really we'd like to be able to combine the first 2 columns into a single Dictionary<string, string> column, but that's a higher level concern (similar to how the individual columns for fields in a struct could be combined with some reflection on the struct), Definitely not something for this PR. But would be nice to at least be able to read/write the individual columns. We currently fail with

System.ArgumentOutOfRangeException: 'unsupported logical type String with physical type ByteArray '

which is because the key is required, hence there's no match in LogicalTypeFactory.cs's DefaultPrimitiveMapping. Suspect there will be a similar thing trying to read/write required lists....

philjdf · 2021-12-29T02:34:51Z

Think it's nearly there with maps and required arrays too with the latest commit.

The thinking is that even if you have to mess around with schema Nodes directly, and deal with data at a lower level than you otherwise might want, at least being able to read/write/roundtrip data for these schemas supports more of the Parquet spec and increases confidence in the ParquetSharp array handling code. I.e. this doesn't try to do anything higher level, e.g. make the TKey[] and TValue[] columns appear as a single Dictionary<TKey, TValue> column.

Also the Python code appears to be doing integration testing, have added a map.parquet file here, maybe this needs some more thinking about.

philjdf · 2021-12-29T02:36:24Z

csharp/LogicalTypeFactory.cs

@@ -134,6 +134,11 @@ public virtual unsafe (Type physicalType, Type logicalType) GetSystemTypes(Colum
                return (typeof(ByteArray), typeof(byte[]));
            }

+            if (logicalType.Type == LogicalTypeEnum.String)


This is here because we don't return earlier in the function because the default mapping doesn't match the column descriptor's required value.

philjdf · 2022-01-05T18:50:55Z

Think the only outstanding issue not covered by this thread is tests for how this new stuff plays with custom type handling.

philjdf · 2022-02-12T11:17:31Z

I did some benchmarking, given it appears reading arrays are slow and we probably need to address that soon. Therefore I wanted to check this doesn't introduce any regressions.

The benchmarks are floats = reading a time series of floats, and bigarray = reading a timeseries of float arrays. The float array benchmark is quite a big file (1GB) because I wanted to make sure there's good signal to noise to profile.

floats, netcoreapp3.1, latest master ParquetSharp:

Method	Mean	Error	StdDev	Ratio	RatioSD	Size (Bytes)
ParquetSharp	181.0 ms	3.49 ms	3.74 ms	1.00	0.00
ParquetDotNet	563.1 ms	11.25 ms	22.20 ms	3.03	0.10

floats, net5.0, latest master ParquetSharp:

Method	Mean	Error	StdDev	Ratio	RatioSD	Size (Bytes)
ParquetSharp	178.1 ms	3.45 ms	4.24 ms	1.00	0.00
ParquetDotNet	551.7 ms	10.91 ms	18.52 ms	3.10	0.13

bigarray, net5.0, latest master ParquetSharp:

Method	Mean	Error	StdDev	Ratio	Size (Bytes)
ParquetSharp	12.38 s	0.043 s	0.040 s	1.00
ParquetDotNet	10.89 s	0.090 s	0.084 s	0.88

(ParquetDotNet seems to use a ton more memory)

floats, net5.0, this branch of ParquetSharp:

Method	Mean	Error	StdDev	Ratio	RatioSD	Size (Bytes)
ParquetSharp	178.6 ms	3.55 ms	6.32 ms	1.00	0.00
ParquetDotNet	531.1 ms	8.90 ms	8.32 ms	2.97	0.15

bigarray, net5.0, this branch of ParquetSharp:

Method	Mean	Error	StdDev	Ratio	Size (Bytes)
ParquetSharp	12.90 s	0.163 s	0.152 s	1.00
ParquetDotNet	10.86 s	0.061 s	0.051 s	0.84

So in summary, netcoreapp3.1 -> net5.0 isn't a big difference, and at least as far as this test case goes, this branch doesn't slow us down but we're slower than ParquetDotNet.

I've committed to this branch:

Update to the latest ParquetDotNet
Add the float array benchmark.
Switched to net5.0 for the benchmark proj.

Benchmark machine: Ryzen 5950X, 64GB memory, Samsung 1TB 970 Evo, as few LEDs as possible (set to white).

philjdf · 2022-02-16T09:21:03Z

#5 is WIP (unit tests aren't passing yet) but it's down to 5s from 12.9s.

philjdf added 14 commits October 31, 2021 10:35

wip

a4cb1a8

split out array and leaf writers

ad1a3bb

Trying to split up code

f690672

separated

5c58ff2

getting further

4edfa7b

start a similar thing for reading

2dd94f3

simplify reading a bit

8792ffd

reading

f73aa3c

working reading

5b59ee9

writing working

df10f4b

required nested struct support

42722e1

clean up

b06fa1c

more cleaning

3b0379d

merge master

4292d24

adamreeve reviewed Dec 6, 2021

View reviewed changes

adamreeve reviewed Dec 7, 2021

View reviewed changes

csharp/CompoundLogicalColumnReader.cs Outdated Show resolved Hide resolved

adamreeve reviewed Dec 7, 2021

View reviewed changes

csharp/CompoundLogicalColumnWriter.cs Outdated Show resolved Hide resolved

adamreeve reviewed Dec 7, 2021

View reviewed changes

philjdf added 3 commits December 12, 2021 12:08

Merge master fixup

e9eec2d

Get test working

d844dd1

Try to get CI passing

1e2ee8a

Code reformat

65868dc

chipping away at code review

00d519e

Required arrays

cabb184

philjdf added 3 commits December 29, 2021 00:41

Maps, maybe

d900860

Add parquet artefact and reformat

a179273

Nearly there with maps

ba66173

philjdf commented Dec 29, 2021

View reviewed changes

philjdf added 4 commits December 29, 2021 18:56

Tidy and test

28abd8f

Memory leak in test

03175d4

Fix memory leak

f79154d

Format

7072c90

philjdf added 5 commits January 9, 2022 17:54

When should DefLevels be null? Add testing.

49e3adf

More tests working

ae82125

Enumerator working

7dad634

merge origin/master

42f14b4

Float array benchmark

ab9d955

philjdf added 2 commits May 12, 2022 22:59

Merge remote-tracking branch 'origin/master' into nested

2739578

linting

ee84ba1

philjdf marked this pull request as ready for review June 29, 2022 19:48

philjdf added 5 commits June 29, 2022 21:50

Merge commit 'b47abdd571280a8b2f82818c4b0d22455629d748' into nested

9074e0d

Merge commit 'bd0067ea207e5383c2baf1e92418a3a628bea2f4' into nested

084a5c3

Merge commit 'c203c35779944712df17a96ca4856e7c1efed0ab' into nested

6ea90e5

Fix error handling

8741bf8

Merge commit 'bc71ce87886f7a6e547f941845655014b29fda5a' into nested

649b74e

adamreeve mentioned this pull request Aug 8, 2022

Read whole leaf-level arrays at once when possible, rather than growing lists G-Research/ParquetSharp#295

Merged

adamreeve mentioned this pull request Aug 29, 2022

Support round-tripping nested values G-Research/ParquetSharp#301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested struct support #4

Nested struct support #4

philjdf commented Dec 5, 2021

adamreeve Dec 6, 2021

adamreeve Dec 6, 2021

adamreeve Dec 7, 2021 •

edited

Loading

adamreeve Dec 6, 2021

adamreeve Dec 6, 2021

adamreeve Dec 6, 2021

adamreeve Dec 7, 2021

adamreeve Dec 7, 2021

adamreeve Dec 7, 2021

philjdf commented Dec 15, 2021

adamreeve commented Dec 15, 2021 •

edited

Loading

philjdf commented Dec 15, 2021

philjdf commented Dec 22, 2021 •

edited

Loading

philjdf commented Dec 29, 2021 •

edited

Loading

philjdf Dec 29, 2021

philjdf commented Jan 5, 2022

philjdf commented Feb 12, 2022 •

edited

Loading

philjdf commented Feb 16, 2022

Nested struct support #4

Are you sure you want to change the base?

Nested struct support #4

Conversation

philjdf commented Dec 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philjdf commented Dec 15, 2021

adamreeve commented Dec 15, 2021 • edited Loading

philjdf commented Dec 15, 2021

philjdf commented Dec 22, 2021 • edited Loading

philjdf commented Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

philjdf commented Jan 5, 2022

philjdf commented Feb 12, 2022 • edited Loading

philjdf commented Feb 16, 2022

adamreeve Dec 7, 2021 •

edited

Loading

adamreeve commented Dec 15, 2021 •

edited

Loading

philjdf commented Dec 22, 2021 •

edited

Loading

philjdf commented Dec 29, 2021 •

edited

Loading

philjdf commented Feb 12, 2022 •

edited

Loading