G-Research · adamreeve · Mar 11, 2022 · Feb 24, 2022 · Feb 24, 2022 · Feb 27, 2022
diff --git a/README.md b/README.md
@@ -25,43 +25,17 @@ Supported platforms:
 | **Pre-Release Nuget** | [![NuGet latest pre-release](https://img.shields.io/nuget/vpre/ParquetSharp.svg)](https://www.nuget.org/packages/ParquetSharp/absoluteLatest)                                                                                  |
 | **CI Build**          | [![CI Status](https://github.com/G-Research/ParquetSharp/actions/workflows/ci.yml/badge.svg?branch=master&event=push)](https://github.com/G-Research/ParquetSharp/actions/workflows/ci.yml?query=branch%3Amaster+event%3Apush) |
 
-## Examples
+## Quickstart
 
-Both examples below output a Parquet file with three columns representing a timeseries of object-value pairs ordered by datetime and object id.
+The following examples show how to write and then read a Parquet file with three columns representing a timeseries of object-value pairs.
+These use the low-level API, which is the recommended API and closely maps to the API of Apache Parquet C++.
 
-### Row-oriented API
-
-The row-oriented API offers a convenient way to abstract the column-oriented nature of Parquet files at the expense of memory, speed and flexibility. It lets one write a whole row in a single call, often resulting in more readable code.
-
-```csharp
-var timestamps = new DateTime[] { /* ... */ };
-var objectIds = new int[] { /* ... */ };
-var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
-var columns = new[] {"Timestamp", "ObjectId", "Value"};
-
-using var rowWriter = ParquetFile.CreateRowWriter<(DateTime, int, float)>("float_timeseries.parquet", columns);
-
-for (int i = 0; i != timestamps.Length; ++i)
-{
-    for (int j = 0; j != objectIds.Length; ++j)
-    {
-        rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
-    }
-}
-
-rowWriter.Close();
-```
-
-The column names can also be explicitly given, see [Row-oriented API (Advanced)](RowOriented.md) for more details.
-
-### Low-level API
-
-This closely maps to the API of Apache Parquet C++. It also provides reader and writer abstractions (`LogicalColumnReader` and `LogicalColumnWriter` respectively) to convert between .NET types and Parquet representations. This is the recommended API.
+Writing a Parquet File:
 
 ```csharp
 var timestamps = new DateTime[] { /* ... */ };
 var objectIds = new int[] { /* ... */ };
-var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
+var values = new float[] { /* ... */ };
 
 var columns = new Column[]
 {
@@ -75,44 +49,46 @@ using var rowGroup = file.AppendRowGroup();
 
 using (var timestampWriter = rowGroup.NextColumn().LogicalWriter<DateTime>())
 {
-    for (int i = 0; i != timestamps.Length; ++i)
-    {
-        timestampWriter.WriteBatch(Enumerable.Repeat(timestamps[i], objectIds.Length).ToArray());
-    }
+    timestampWriter.WriteBatch(timestamps);
 }
-
 using (var objectIdWriter = rowGroup.NextColumn().LogicalWriter<int>())
 {
-    for (int i = 0; i != timestamps.Length; ++i)
-    {
-        objectIdWriter.WriteBatch(objectIds);
-    }
+    objectIdWriter.WriteBatch(objectIds);
 }
-
 using (var valueWriter = rowGroup.NextColumn().LogicalWriter<float>())
 {
-    for (int i = 0; i != timestamps.Length; ++i)
-    {
-        valueWriter.WriteBatch(values[i]);
-    }
+    valueWriter.WriteBatch(values);
 }
 
 file.Close();
 ```
 
-### Custom Types
+Reading the file back:
+
+```csharp
+using var file = new ParquetFileReader("float_timeseries.parquet");
+
+for (int rowGroup = 0; rowGroup < file.FileMetaData.NumRowGroups; ++rowGroup) {
+    using var rowGroupReader = file.RowGroup(rowGroup);
+    var groupNumRows = checked((int) rowGroupReader.MetaData.NumRows);
+
+    var groupTimestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(groupNumRows);
+    var groupObjectIds = rowGroupReader.Column(1).LogicalReader<int>().ReadAll(groupNumRows);
+    var groupValues = rowGroupReader.Column(2).LogicalReader<float>().ReadAll(groupNumRows);
+}
+
+file.Close();
+```
 
-ParquetSharp allows the user to override the mapping between C# and Parquet types. Check the [Type Factories documentation](TypeFactories.md) for more information.
+## Documentation
 
-### PowerShell
+For more detailed information on how to use ParquetSharp, see the following documentation:
 
-It's possible to use ParquetSharp from PowerShell.
-You can install ParquetSharp with the [NuGet command line interface](https://docs.microsoft.com/en-us/nuget/reference/nuget-exe-cli-reference),
-then use `Add-Type` to load `ParquetSharp.dll`.
-However, you must ensure that the appropriate `ParquetSharpNative.dll` for your architecture and OS can be loaded as required,
-either by putting it somewhere in your `PATH` or in the same directory as `ParquetSharp.dll`.
-For examples of how to use ParquetSharp from PowerShell,
-see [these scripts from Apteco](https://github.com/Apteco/HelperScripts/tree/master/scripts/parquet).
+* [Writing parquet files](docs/Writing.md)
+* [Reading parquet files](docs/Reading.md)
+* [Row-oriented API](docs/RowOriented.md) &mdash; a higher level API that abstracts away the column-oriented nature of Parquet files
+* [Custom types](docs/TypeFactories.md) &mdash; how to override the mapping between .NET and Parquet types
+* [Use from PowerShell](docs/PowerShell.md)
 
 ## Rationale
 

diff --git a/RowOriented.md b/RowOriented.md
diff --git a/docs/PowerShell.md b/docs/PowerShell.md
@@ -0,0 +1,10 @@
+# ParquetSharp in PowerShell
+
+It's possible to use ParquetSharp from PowerShell.
+You can install ParquetSharp with the [NuGet command line interface](https://docs.microsoft.com/en-us/nuget/reference/nuget-exe-cli-reference),
+then use `Add-Type` to load `ParquetSharp.dll`.
+However, you must ensure that the appropriate `ParquetSharpNative.dll` for your architecture and OS can be loaded as required,
+either by putting it somewhere in your `PATH` or in the same directory as `ParquetSharp.dll`.
+For examples of how to use ParquetSharp from PowerShell,
+see [these scripts from Apteco](https://github.com/Apteco/HelperScripts/tree/master/scripts/parquet).
+
diff --git a/docs/Reading.md b/docs/Reading.md
@@ -0,0 +1,99 @@
+# Reading Parquet files
+
+The low-level ParquetSharp API provides the `ParquetFileReader` class for reading Parquet files.
+This is usually constructed from a file path, but may also be constructed from a `ManagedRandomAccessFile`,
+which wraps a .NET `System.IO.Stream` that supports seeking.
+
+```csharp
+using var fileReader = new ParquetFileReader("data.parquet");
+```
+or
+```csharp
+using var input = new ManagedRandomAccessFile(File.OpenRead("data.parquet"));
+using var fileReader = new ParquetFileReader(input);
+```
+
+The `FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema:
+```csharp
+int numColumns = fileReader.FileMetaData.NumColumns;
+long numRows = fileReader.FileMetaData.NumRows;
+int numRowGroups = fileReader.FileMetaData.NumRowGroups;
+IReadOnlyDictionary<string, string> metadata = fileReader.FileMetaData.KeyValueMetadata;
+
+SchemaDescriptor schema = fileReader.FileMetaData.Schema;
+for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) {
+    ColumnDescriptor colum = schema.Column(columnIndex);
+    string columnName = column.Name;
+}
+```
+
+Parquet files store data in separate row groups, which all share the same schema,
+so if you wish to read all data in a file, you generally want to loop over all of the row groups
+and create a `RowGroupReader` for each one:
+
+```csharp
+for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup) {
+    using var rowGroupReader = fileReader.RowGroup(rowGroup);
+    long groupNumRows = rowGroupReader.MetaData.NumRows;
+}
+```
+
+The `Column` method of `RowGroupReader` takes an integer column index and returns a `ColumnReader` object,
+which can read primitive values from the column, as well as raw definition level and repetition level data.
+Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to
+create a `LogicalColumnReader` that can read logical values.
+There are two variations of this `LogicalReader` method; the plain `LogicalReader` method returns an abstract
+`LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`,
+which reads values of the specified element type.
+
+If you know ahead of time the data types for the columns you will read, you can simply use the generic methods and
+read values directly. For example, to read data from the first column which represents a timestamp:
+
+```csharp
+DateTime[] timestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(numRows);
+```
+
+However, if you don't know ahead of time the types for each column, you can implement the
+`ILogicalColumnReaderVisitor<TReturn>` interface to handle column data in a type-safe way, for example:
+
+```csharp
+sealed class ColumnPrinter : ILogicalColumnReaderVisitor<string>
+{
+    public string OnLogicalColumnReader<TElement>(LogicalColumnReader<TElement> columnReader)
+    {
+        var stringBuilder = new StringBuilder();
+        foreach (var value in columnReader) {
+            stringBuilder.Append(value?.ToString() ?? "null");
+            stringBuilder.Append(",");
+        }
+        return stringBuilder.ToString();
+    }
+}
+
+string columnValues = rowGroupReader.Column(0).LogicalReader().Apply(new ColumnPrinter());
+```
+
+There's a similar `IColumnReaderVisitor<TReturn>` interface for working with `ColumnReader` objects
+and reading physical values in a type-safe way, but most users will want to work at the logical element level.
+
+The `LogicalColumnReader<TElement>` class provides multiple ways to read data.
+It implements `IEnumerable<TElement>` which internally buffers batches of data and iterates over them,
+but for more fine-grained control over reading behaviour, you can read into your own buffer. For example:
+
+```csharp
+var buffer = new TElement[4096];
+
+while (logicalColumnReader.HasNext)
+{
+    int numRead = logicalColumnReader.ReadBatch(buffer);
+
+    for (int i = 0; i != numRead; ++i)
+    {
+        TElement value = buffer[i];
+        // Use value
+    }
+}
+```
+
+The .NET type used to represent read values can optionally be overridden by using the `ColumnReader.LogicalReaderOverride<TElement>` method.
+For more details, see the [type factories documentation](TypeFactories.md).
diff --git a/docs/RowOriented.md b/docs/RowOriented.md
@@ -0,0 +1,69 @@
+# Row-oriented API
+
+The row-oriented API offers a convenient way to abstract the column-oriented nature of Parquet files
+at the expense of memory, speed and flexibility.
+It lets one write a whole row in a single call, often resulting in more readable code.
+
+For example, writing a file with the row-oriented API and using a tuple to represent a row of values:
+
+```csharp
+var timestamps = new DateTime[] { /* ... */ };
+var objectIds = new int[] { /* ... */ };
+var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
+var columns = new[] {"Timestamp", "ObjectId", "Value"};
+
+using var rowWriter = ParquetFile.CreateRowWriter<(DateTime, int, float)>("float_timeseries.parquet", columns);
+
+for (int i = 0; i != timestamps.Length; ++i)
+{
+    for (int j = 0; j != objectIds.Length; ++j)
+    {
+        rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
+    }
+}
+
+// Write a new row group (pretend we have new timestamps, objectIds and values)
+rowWriter.StartNewRowGroup();
+for (int i = 0; i != timestamps.Length; ++i)
+{
+    for (int j = 0; j != objectIds.Length; ++j)
+    {
+        rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
+    }
+}
+
+rowWriter.Close();
+```
+
+Internally, ParquetSharp will build up a buffer of row values and then write each column when the file
+is closed or a new row group is started.
+This means all values in a row group must be stored in memory at once,
+and the row values buffer must be resized and copied as it grows.
+Therefore, it's recommended to use the lower-level column oriented API if performance is a concern.
+
+## Explicit column mapping
+
+The row-oriented API allows for specifying your own name-independent/order-independent column mapping using the optional `MapToColumn` attribute.
+
+```csharp
+struct MyRow
+{
+    [MapToColumn("ColumnA")]
+    public long MyKey;
+
+    [MapToColumn("ColumnB")]
+    public string MyValue;
+}
+
+using (var rowReader = ParquetFile.CreateRowReader<MyRow>("example.parquet"))
+{
+    for (int i = 0; i < rowReader.FileMetaData.NumRowGroups; ++i)
+    {
+        var values = rowReader.ReadRows(i);
+        foreach (MyRow r in values)
+        {
+            Console.WriteLine(r.MyKey + "/" + r.MyValue);
+        }
+    }
+}
+```
diff --git a/TypeFactories.md → docs/TypeFactories.md b/TypeFactories.md → docs/TypeFactories.md
@@ -1,12 +1,12 @@
-## Type Factories
+# Type Factories
 
 ParquetSharp API exposes the logic that maps the C# types (called "logical system types" by ParquetSharp, as per Parquet's LogicalType) to the actual Parquet physical types, as well as the converters that are associated with them.
 
 This means that:
 - a user can potentially read/write any type they want, as long as they provide a viable mapping,
 - a user can override the default ParquetSharp mapping and change how existing C# types are handled.
 
-### API
+## API
 
 The API at the core of this is encompassed by `LogicalTypeFactory`, `LogicalReadConverterFactory` and `LogicalWriteConverterFactory`.
 
@@ -16,7 +16,7 @@ On top of that, if the custom type is used for creating the schema (when writing
 
 In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical parquet type) or gets the `LogicalColumnReader/Writer` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.
 
-### Examples
+## Examples
 
 One of the approaches for reading custom values can be described by the following code.
 
@@ -26,16 +26,16 @@ One of the approaches for reading custom values can be described by the followin
     using var columnReader = groupReader.Column(0).LogicalReaderOverride<VolumeInDollars>();
 
     var values = columnReader.ReadAll(checked((int) groupReader.MetaData.NumRows));
-    
+
     /* ... */
-    
+
     [StructLayout(LayoutKind.Sequential)]
     private readonly struct VolumeInDollars
     {
         public VolumeInDollars(float value) { Value = value; }
         public readonly float Value;
     }
-    
+
     private sealed class ReadConverterFactory : LogicalReadConverterFactory
     {
         public override Delegate GetConverter<TLogical, TPhysical>(ColumnDescriptor columnDescriptor, ColumnChunkMetaData columnChunkMetaData)
@@ -46,4 +46,4 @@ One of the approaches for reading custom values can be described by the followin
     }
 ```
 
-But do check [TestLogicalTypeFactory.cs](csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
+But do check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.