Documentation Enhancements (#383)

G-Research · Oct 16, 2023 · 69afa87 · 69afa87
1 parent 4ec1e20
commit 69afa87
Show file tree

Hide file tree

Showing 5 changed files with 71 additions and 21 deletions.
diff --git a/docs/Arrow.md b/docs/Arrow.md
@@ -18,6 +18,8 @@ In this example, we'll open a file using a path:
 using var fileReader = new FileReader("data.parquet");
 ```
 
+### Inspecting the schema
+
 We can then inspect the Arrow schema that will be used when reading the file:
 
 ```csharp
@@ -28,6 +30,8 @@ foreach (var field in schema.FieldsList)
 }
 ```
 
+### Reading data
+
 To read data from the file, we use the `GetRecordBatchReader` method,
 which returns an `Apache.Arrow.IArrowArrayStream`.
 By default, this will read data for all row groups in the file and all columns,
@@ -138,7 +142,9 @@ using var writer = new FileWriter("data.parquet", schema);
 ```
 
 Rather than specifying a file path, we could also write to a .NET `System.IO.Stream`
-or a subclass of `ParquetShap.IO.OutputStream`.
+or a subclass of `ParquetSharp.IO.OutputStream`.
+
+### Writing data in batches
 
 Now we're ready to write batches of data:
 
@@ -158,6 +164,8 @@ if it contains more rows than the chunk size, which can be specified when writin
 writer.WriteRecordBatch(recordBatch, chunkSize: 1024);
 ```
 
+### Writing data one column at a time
+
 Rather than writing record batches, you may also explicitly start Parquet row groups
 and write data one column at a time, for more control over how data is written:
 
@@ -172,6 +180,8 @@ for (var batchNumber = 0; batchNumber < 10; ++batchNumber)
 }
 ```
 
+### Closing the file
+
 Finally, we should call the `Close` method when we have finished writing data,
 which will write the Parquet file footer and close the file.
 It is recommended to always explicitly call `Close`

diff --git a/docs/Nested.md b/docs/Nested.md
@@ -31,12 +31,14 @@ Imagine we have the following JSON object we would like to store as Parquet:
 }
 ```
 
+### Building the schema
+
 We will store this data with one object per logical row in the Parquet file.
 We could use a schema with a `message` column and an `ids` column,
 but when both `message` and `ids` are null we would not be able to determine whether
-this is because the top level object was null in the source data,
+this is because the top-level object was null in the source data,
 or we had a non-null object with a null `message` and null `ids`.
-Instead we will represent this data in Parquet with a single
+Instead, we will represent this data in Parquet with a single
 `objects` column.
 
 In the Parquet schema, we have one one top-level group node named `objects`,
@@ -49,7 +51,7 @@ The schema needs to be built from the bottom up, and can be defined as follows:
 using var messageNode = new PrimitiveNode(
         "message", Repetition.Optional, LogicalType.String(), PhysicalType.ByteArray);
 
-// Lists are defined with three nodes, an outer List annotated node,
+// Lists are defined with three nodes: an outer List annotated node,
 // an inner repeated group named "list", and an inner "item" node for list elements.
 using var itemNode = new PrimitiveNode(
         "item", Repetition.Required, LogicalType.None(), PhysicalType.Int32);
@@ -67,6 +69,8 @@ using var schema = new GroupNode(
         "schema", Repetition.Required, new Node[] {groupNode});
 ```
 
+### Writing data
+
 We can then create a `ParquetFileWriter` with this schema:
 
 ```csharp
@@ -165,9 +169,9 @@ for (var i = 0; i < numRows; ++i)
 }
 ```
 
-Reading data wrapped in the `Nested` type is optional and this type can be ommitted
-from the `TElement` parameter passed to the `LogicalReader<TElement>` method to read unwrapped values,
-for example:
+Reading data wrapped in the `Nested` type is optional. The `Nested` type can be omitted
+from the `TElement` parameter passed to the `LogicalReader<TElement>` method to read unwrapped values.
+For example:
 
 ```csharp
 using var messagesReader = groupReader.Column(0).LogicalReader<string?>();
@@ -181,7 +185,7 @@ If using the non-generic `LogicalReader` method,
 the `Nested` wrapper type is not used by default for simplicity and backwards compatibility,
 but this behaviour can be changed by using the override that takes a `useNesting` parameter.
 
-## Maps
+## Working with .NET Dictionary data
 
 The Map logical type in Parquet represents a map from keys to values,
 and is a special case of nested data.
@@ -196,8 +200,9 @@ The first contains arrays of the map keys,
 and the second contains arrays of the map values,
 and the arrays corresponding to the same row must have the same length.
 
-The following example shows how dotnet dictionary data might be written
-and then read from Parquet:
+### Writing dictionary data
+
+The following example shows how to write dictionary data to Parquet:
 
 ```csharp
 // Start with a single column of dictionary data
@@ -231,7 +236,13 @@ using (var fileWriter = new ParquetFileWriter("map_data.parquet", schema, writer
     valueWriter.WriteBatch(values);
     fileWriter.Close();
 }
+```
 
+### Reading dictionary data
+
+We can read data from a Parquet file into a `Dictionary` array as follows:
+
+```csharp
 // Read back key and value columns from the file
 string[][] readKeys;
 int[][] readValues;

diff --git a/docs/Reading.md b/docs/Reading.md
@@ -13,7 +13,10 @@ using var input = new ManagedRandomAccessFile(File.OpenRead("data.parquet"));
 using var fileReader = new ParquetFileReader(input);
 ```
 
+### Obtaining file metadata
+
 The `FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema:
+
 ```csharp
 int numColumns = fileReader.FileMetaData.NumColumns;
 long numRows = fileReader.FileMetaData.NumRows;
@@ -22,11 +25,13 @@ IReadOnlyDictionary<string, string> metadata = fileReader.FileMetaData.KeyValueM
 
 SchemaDescriptor schema = fileReader.FileMetaData.Schema;
 for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) {
-    ColumnDescriptor colum = schema.Column(columnIndex);
+    ColumnDescriptor column = schema.Column(columnIndex);
     string columnName = column.Name;
 }
 ```
 
+### Reading row groups
+
 Parquet files store data in separate row groups, which all share the same schema,
 so if you wish to read all data in a file, you generally want to loop over all of the row groups
 and create a `RowGroupReader` for each one:
@@ -38,6 +43,8 @@ for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGro
 }
 ```
 
+### Reading columns directly
+
 The `Column` method of `RowGroupReader` takes an integer column index and returns a `ColumnReader` object,
 which can read primitive values from the column, as well as raw definition level and repetition level data.
 Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to
@@ -46,13 +53,16 @@ There are two variations of this `LogicalReader` method; the plain `LogicalReade
 `LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`,
 which reads values of the specified element type.
 
+
 If you know ahead of time the data types for the columns you will read, you can simply use the generic methods and
 read values directly. For example, to read data from the first column which represents a timestamp:
 
 ```csharp
 DateTime[] timestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(numRows);
 ```
 
+### Reading columns with unknown types
+
 However, if you don't know ahead of time the types for each column, you can implement the
 `ILogicalColumnReaderVisitor<TReturn>` interface to handle column data in a type-safe way, for example:
 
@@ -76,6 +86,9 @@ string columnValues = rowGroupReader.Column(0).LogicalReader().Apply(new ColumnP
 There's a similar `IColumnReaderVisitor<TReturn>` interface for working with `ColumnReader` objects
 and reading physical values in a type-safe way, but most users will want to work at the logical element level.
 
+
+### Reading data in batches
+
 The `LogicalColumnReader<TElement>` class provides multiple ways to read data.
 It implements `IEnumerable<TElement>` which internally buffers batches of data and iterates over them,
 but for more fine-grained control over reading behaviour, you can read into your own buffer. For example:
@@ -99,6 +112,7 @@ The .NET type used to represent read values can optionally be overridden by usin
 For more details, see the [type factories documentation](TypeFactories.md).
 
 ## DateTimeKind when reading Timestamps
+
 When reading Timestamp to a DateTime, ParquetSharp sets the DateTimeKind based on the value of `IsAdjustedToUtc`.
 
 If `IsAdjustedToUtc` is `true` the DateTimeKind will be set to `DateTimeKind.Utc` otherwise it will be set to `DateTimeKind.Unspecified`.
@@ -135,7 +149,7 @@ has the `Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
 registry key enabled, and the application must have a manifest that specifies it is long path aware,
 for example:
 
-```
+```xml
 <application xmlns="urn:schemas-microsoft-com:asm.v3">
     <windowsSettings xmlns:ws2="http://schemas.microsoft.com/SMI/2016/WindowsSettings">
         <ws2:longPathAware>true</ws2:longPathAware>

diff --git a/docs/TypeFactories.md b/docs/TypeFactories.md
@@ -10,17 +10,17 @@ This means that:
 
 The API at the core of this is encompassed by `LogicalTypeFactory`, `LogicalReadConverterFactory` and `LogicalWriteConverterFactory`.
 
-Whenever the user uses a custom type to read or write values to a Parquet file, a `LogicalRead/WriteConverterFactory` needs to be provided. This converter factory tells to the `LogicalColumnReader/Writer` how to convert the user custom type into a physical type that is understood by Parquet.
+Whenever the user uses a custom type to read or write values to a Parquet file, a `LogicalReadConverterFactory` or `LogicalWriteConverterFactory` needs to be provided. This converter factory tells to the `LogicalColumnReader` or `LogicalColumnWriter` how to convert the user custom type into a physical type that is understood by Parquet.
 
-On top of that, if the custom type is used for creating the schema (when writing), or if accessing a `LogicalColumnReader/Writer` without explicitly overriding the element type (e.g. `columnWriter.LogicalReaderOverride<CustomType>()`), then a `LogicalTypeFactory` is needed in order to establish the proper logical type mapping.
+On top of that, if the custom type is used for creating the schema (when writing), or if accessing a `LogicalColumnReader` or `LogicalColumnWriter` without explicitly overriding the element type (e.g. `columnWriter.LogicalReaderOverride<CustomType>()`), then a `LogicalTypeFactory` is needed in order to establish the proper logical type mapping.
 
-In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical parquet type) or gets the `LogicalColumnReader/Writer` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.
+In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical Parquet type) or gets the `LogicalColumnReader` or `LogicalColumnWriter` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.
 
 ## Examples
 
-One of the approaches for reading custom values can be described by the following code.
+One of the approaches for reading custom values can be described by the following code:
 
-```C#
+```csharp
     using var fileReader = new ParquetFileReader(filename) { LogicalReadConverterFactory = new ReadConverterFactory() };
     using var groupReader = fileReader.RowGroup(0);
     using var columnReader = groupReader.Column(0).LogicalReaderOverride<VolumeInDollars>();
@@ -46,4 +46,6 @@ One of the approaches for reading custom values can be described by the followin
     }
 ```
 
-But do check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
+### Learn More
+
+Check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
diff --git a/docs/Writing.md b/docs/Writing.md
@@ -6,9 +6,8 @@ The low-level ParquetSharp API provides the `ParquetFileWriter` class for writin
 
 When writing a Parquet file, you must define the schema up-front, which specifies all of the columns
 in the file along with their names and types.
-This schema can be defined using a graph of `ParquetSharp.Schema.Node` instances,
-starting from a root `GroupNode`,
-but ParquetSharp also provides a convenient higher level API for defining the schema as an array
+
+ParquetSharp provides a convenient higher level API for defining the schema as an array
 of `Column` objects.
 A `Column` can be constructed using only a name and a type parameter that is used to
 determine the logical Parquet type to write:
@@ -24,10 +23,16 @@ var columns = new Column[]
 using var file = new ParquetFileWriter("float_timeseries.parquet", columns);
 ```
 
+The schema can also be defined using a graph of `ParquetSharp.Schema.Node` instances,
+starting from a root `GroupNode`. For concrete examples, see [How to write a file with nested columns](Nested.md).
+
+### Overriding logical types
+
 For more control over how values are represented in the Parquet file,
 you can pass a `LogicalType` instance as the `logicalTypeOverride` parameter of the `Column` constructor.
 
 For example, you may wish to write times or timestamps with millisecond resolution rather than the default microsecond resolution:
+
 ```csharp
 var timestampColumn = new Column<DateTime>(
         "Timestamp", LogicalType.Timestamp(isAdjustedToUtc: true, timeUnit: TimeUnit.Millis));
@@ -37,10 +42,13 @@ var timeColumn = new Column<TimeSpan>(
 
 When writing decimal values, you must provide a `logicalTypeOverride` to define the precision and scale type parameters.
 Currently the precision must be 29.
+
 ```csharp
 var decimalColumn = new Column<decimal>("Values", LogicalType.Decimal(precision: 29, scale: 3);
 ```
 
+### Metadata
+
 As well as defining the file schema, you may optionally provide key-value metadata that is stored in the file when creating
 a `ParquetFileWriter`:
 
@@ -73,6 +81,7 @@ using (var stream = new FileStream("float_timeseries.parquet", FileMode.Create))
 
 Parquet data is written in batches of column data named row groups.
 To begin writing data, you first create a new row group:
+
 ```csharp
 using RowGroupWriter rowGroup = file.AppendRowGroup();
 ```
@@ -100,6 +109,8 @@ you may append another row group to the file and repeat the row group writing pr
 The `NextColumn` method of `RowGroupWriter` returns a `ColumnWriter`, which writes physical values to the file,
 and can write definition level and repetition level data to support nullable and array values.
 
+### Using LogicalColumnWriter
+
 Rather than working with a `ColumnWriter` directly, it's usually more convenient to create a `LogicalColumnWriter`
 with the `ColumnWriter.LogicalWriter<TElement>` method.
 This allows writing an array or `ReadOnlySpan` of `TElement` to the column data,
@@ -132,6 +143,8 @@ for (int columnIndex = 0; columnIndex < file.NumColumns; ++columnIndex)
 }
 ```
 
+### Closing the ParquetFileWriter
+
 Note that it's important to explicitly call `Close` on the `ParquetFileWriter` when writing is complete,
 as otherwise any errors encountered when writing may be silently ignored: