Skip to content

Commit

Permalink
Documentation Enhancements (#383)
Browse files Browse the repository at this point in the history
  • Loading branch information
jescalada authored Oct 16, 2023
1 parent 4ec1e20 commit 69afa87
Show file tree
Hide file tree
Showing 5 changed files with 71 additions and 21 deletions.
12 changes: 11 additions & 1 deletion docs/Arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ In this example, we'll open a file using a path:
using var fileReader = new FileReader("data.parquet");
```

### Inspecting the schema

We can then inspect the Arrow schema that will be used when reading the file:

```csharp
Expand All @@ -28,6 +30,8 @@ foreach (var field in schema.FieldsList)
}
```

### Reading data

To read data from the file, we use the `GetRecordBatchReader` method,
which returns an `Apache.Arrow.IArrowArrayStream`.
By default, this will read data for all row groups in the file and all columns,
Expand Down Expand Up @@ -138,7 +142,9 @@ using var writer = new FileWriter("data.parquet", schema);
```

Rather than specifying a file path, we could also write to a .NET `System.IO.Stream`
or a subclass of `ParquetShap.IO.OutputStream`.
or a subclass of `ParquetSharp.IO.OutputStream`.

### Writing data in batches

Now we're ready to write batches of data:

Expand All @@ -158,6 +164,8 @@ if it contains more rows than the chunk size, which can be specified when writin
writer.WriteRecordBatch(recordBatch, chunkSize: 1024);
```

### Writing data one column at a time

Rather than writing record batches, you may also explicitly start Parquet row groups
and write data one column at a time, for more control over how data is written:

Expand All @@ -172,6 +180,8 @@ for (var batchNumber = 0; batchNumber < 10; ++batchNumber)
}
```

### Closing the file

Finally, we should call the `Close` method when we have finished writing data,
which will write the Parquet file footer and close the file.
It is recommended to always explicitly call `Close`
Expand Down
29 changes: 20 additions & 9 deletions docs/Nested.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,14 @@ Imagine we have the following JSON object we would like to store as Parquet:
}
```

### Building the schema

We will store this data with one object per logical row in the Parquet file.
We could use a schema with a `message` column and an `ids` column,
but when both `message` and `ids` are null we would not be able to determine whether
this is because the top level object was null in the source data,
this is because the top-level object was null in the source data,
or we had a non-null object with a null `message` and null `ids`.
Instead we will represent this data in Parquet with a single
Instead, we will represent this data in Parquet with a single
`objects` column.

In the Parquet schema, we have one one top-level group node named `objects`,
Expand All @@ -49,7 +51,7 @@ The schema needs to be built from the bottom up, and can be defined as follows:
using var messageNode = new PrimitiveNode(
"message", Repetition.Optional, LogicalType.String(), PhysicalType.ByteArray);

// Lists are defined with three nodes, an outer List annotated node,
// Lists are defined with three nodes: an outer List annotated node,
// an inner repeated group named "list", and an inner "item" node for list elements.
using var itemNode = new PrimitiveNode(
"item", Repetition.Required, LogicalType.None(), PhysicalType.Int32);
Expand All @@ -67,6 +69,8 @@ using var schema = new GroupNode(
"schema", Repetition.Required, new Node[] {groupNode});
```

### Writing data

We can then create a `ParquetFileWriter` with this schema:

```csharp
Expand Down Expand Up @@ -165,9 +169,9 @@ for (var i = 0; i < numRows; ++i)
}
```

Reading data wrapped in the `Nested` type is optional and this type can be ommitted
from the `TElement` parameter passed to the `LogicalReader<TElement>` method to read unwrapped values,
for example:
Reading data wrapped in the `Nested` type is optional. The `Nested` type can be omitted
from the `TElement` parameter passed to the `LogicalReader<TElement>` method to read unwrapped values.
For example:

```csharp
using var messagesReader = groupReader.Column(0).LogicalReader<string?>();
Expand All @@ -181,7 +185,7 @@ If using the non-generic `LogicalReader` method,
the `Nested` wrapper type is not used by default for simplicity and backwards compatibility,
but this behaviour can be changed by using the override that takes a `useNesting` parameter.

## Maps
## Working with .NET Dictionary data

The Map logical type in Parquet represents a map from keys to values,
and is a special case of nested data.
Expand All @@ -196,8 +200,9 @@ The first contains arrays of the map keys,
and the second contains arrays of the map values,
and the arrays corresponding to the same row must have the same length.

The following example shows how dotnet dictionary data might be written
and then read from Parquet:
### Writing dictionary data

The following example shows how to write dictionary data to Parquet:

```csharp
// Start with a single column of dictionary data
Expand Down Expand Up @@ -231,7 +236,13 @@ using (var fileWriter = new ParquetFileWriter("map_data.parquet", schema, writer
valueWriter.WriteBatch(values);
fileWriter.Close();
}
```

### Reading dictionary data

We can read data from a Parquet file into a `Dictionary` array as follows:

```csharp
// Read back key and value columns from the file
string[][] readKeys;
int[][] readValues;
Expand Down
18 changes: 16 additions & 2 deletions docs/Reading.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ using var input = new ManagedRandomAccessFile(File.OpenRead("data.parquet"));
using var fileReader = new ParquetFileReader(input);
```

### Obtaining file metadata

The `FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema:

```csharp
int numColumns = fileReader.FileMetaData.NumColumns;
long numRows = fileReader.FileMetaData.NumRows;
Expand All @@ -22,11 +25,13 @@ IReadOnlyDictionary<string, string> metadata = fileReader.FileMetaData.KeyValueM

SchemaDescriptor schema = fileReader.FileMetaData.Schema;
for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) {
ColumnDescriptor colum = schema.Column(columnIndex);
ColumnDescriptor column = schema.Column(columnIndex);
string columnName = column.Name;
}
```

### Reading row groups

Parquet files store data in separate row groups, which all share the same schema,
so if you wish to read all data in a file, you generally want to loop over all of the row groups
and create a `RowGroupReader` for each one:
Expand All @@ -38,6 +43,8 @@ for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGro
}
```

### Reading columns directly

The `Column` method of `RowGroupReader` takes an integer column index and returns a `ColumnReader` object,
which can read primitive values from the column, as well as raw definition level and repetition level data.
Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to
Expand All @@ -46,13 +53,16 @@ There are two variations of this `LogicalReader` method; the plain `LogicalReade
`LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`,
which reads values of the specified element type.


If you know ahead of time the data types for the columns you will read, you can simply use the generic methods and
read values directly. For example, to read data from the first column which represents a timestamp:

```csharp
DateTime[] timestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(numRows);
```

### Reading columns with unknown types

However, if you don't know ahead of time the types for each column, you can implement the
`ILogicalColumnReaderVisitor<TReturn>` interface to handle column data in a type-safe way, for example:

Expand All @@ -76,6 +86,9 @@ string columnValues = rowGroupReader.Column(0).LogicalReader().Apply(new ColumnP
There's a similar `IColumnReaderVisitor<TReturn>` interface for working with `ColumnReader` objects
and reading physical values in a type-safe way, but most users will want to work at the logical element level.


### Reading data in batches

The `LogicalColumnReader<TElement>` class provides multiple ways to read data.
It implements `IEnumerable<TElement>` which internally buffers batches of data and iterates over them,
but for more fine-grained control over reading behaviour, you can read into your own buffer. For example:
Expand All @@ -99,6 +112,7 @@ The .NET type used to represent read values can optionally be overridden by usin
For more details, see the [type factories documentation](TypeFactories.md).

## DateTimeKind when reading Timestamps

When reading Timestamp to a DateTime, ParquetSharp sets the DateTimeKind based on the value of `IsAdjustedToUtc`.

If `IsAdjustedToUtc` is `true` the DateTimeKind will be set to `DateTimeKind.Utc` otherwise it will be set to `DateTimeKind.Unspecified`.
Expand Down Expand Up @@ -135,7 +149,7 @@ has the `Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
registry key enabled, and the application must have a manifest that specifies it is long path aware,
for example:

```
```xml
<application xmlns="urn:schemas-microsoft-com:asm.v3">
<windowsSettings xmlns:ws2="http://schemas.microsoft.com/SMI/2016/WindowsSettings">
<ws2:longPathAware>true</ws2:longPathAware>
Expand Down
14 changes: 8 additions & 6 deletions docs/TypeFactories.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@ This means that:

The API at the core of this is encompassed by `LogicalTypeFactory`, `LogicalReadConverterFactory` and `LogicalWriteConverterFactory`.

Whenever the user uses a custom type to read or write values to a Parquet file, a `LogicalRead/WriteConverterFactory` needs to be provided. This converter factory tells to the `LogicalColumnReader/Writer` how to convert the user custom type into a physical type that is understood by Parquet.
Whenever the user uses a custom type to read or write values to a Parquet file, a `LogicalReadConverterFactory` or `LogicalWriteConverterFactory` needs to be provided. This converter factory tells to the `LogicalColumnReader` or `LogicalColumnWriter` how to convert the user custom type into a physical type that is understood by Parquet.

On top of that, if the custom type is used for creating the schema (when writing), or if accessing a `LogicalColumnReader/Writer` without explicitly overriding the element type (e.g. `columnWriter.LogicalReaderOverride<CustomType>()`), then a `LogicalTypeFactory` is needed in order to establish the proper logical type mapping.
On top of that, if the custom type is used for creating the schema (when writing), or if accessing a `LogicalColumnReader` or `LogicalColumnWriter` without explicitly overriding the element type (e.g. `columnWriter.LogicalReaderOverride<CustomType>()`), then a `LogicalTypeFactory` is needed in order to establish the proper logical type mapping.

In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical parquet type) or gets the `LogicalColumnReader/Writer` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.
In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical Parquet type) or gets the `LogicalColumnReader` or `LogicalColumnWriter` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.

## Examples

One of the approaches for reading custom values can be described by the following code.
One of the approaches for reading custom values can be described by the following code:

```C#
```csharp
using var fileReader = new ParquetFileReader(filename) { LogicalReadConverterFactory = new ReadConverterFactory() };
using var groupReader = fileReader.RowGroup(0);
using var columnReader = groupReader.Column(0).LogicalReaderOverride<VolumeInDollars>();
Expand All @@ -46,4 +46,6 @@ One of the approaches for reading custom values can be described by the followin
}
```

But do check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
### Learn More

Check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
19 changes: 16 additions & 3 deletions docs/Writing.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@ The low-level ParquetSharp API provides the `ParquetFileWriter` class for writin

When writing a Parquet file, you must define the schema up-front, which specifies all of the columns
in the file along with their names and types.
This schema can be defined using a graph of `ParquetSharp.Schema.Node` instances,
starting from a root `GroupNode`,
but ParquetSharp also provides a convenient higher level API for defining the schema as an array

ParquetSharp provides a convenient higher level API for defining the schema as an array
of `Column` objects.
A `Column` can be constructed using only a name and a type parameter that is used to
determine the logical Parquet type to write:
Expand All @@ -24,10 +23,16 @@ var columns = new Column[]
using var file = new ParquetFileWriter("float_timeseries.parquet", columns);
```

The schema can also be defined using a graph of `ParquetSharp.Schema.Node` instances,
starting from a root `GroupNode`. For concrete examples, see [How to write a file with nested columns](Nested.md).

### Overriding logical types

For more control over how values are represented in the Parquet file,
you can pass a `LogicalType` instance as the `logicalTypeOverride` parameter of the `Column` constructor.

For example, you may wish to write times or timestamps with millisecond resolution rather than the default microsecond resolution:

```csharp
var timestampColumn = new Column<DateTime>(
"Timestamp", LogicalType.Timestamp(isAdjustedToUtc: true, timeUnit: TimeUnit.Millis));
Expand All @@ -37,10 +42,13 @@ var timeColumn = new Column<TimeSpan>(

When writing decimal values, you must provide a `logicalTypeOverride` to define the precision and scale type parameters.
Currently the precision must be 29.

```csharp
var decimalColumn = new Column<decimal>("Values", LogicalType.Decimal(precision: 29, scale: 3);
```

### Metadata

As well as defining the file schema, you may optionally provide key-value metadata that is stored in the file when creating
a `ParquetFileWriter`:

Expand Down Expand Up @@ -73,6 +81,7 @@ using (var stream = new FileStream("float_timeseries.parquet", FileMode.Create))

Parquet data is written in batches of column data named row groups.
To begin writing data, you first create a new row group:

```csharp
using RowGroupWriter rowGroup = file.AppendRowGroup();
```
Expand Down Expand Up @@ -100,6 +109,8 @@ you may append another row group to the file and repeat the row group writing pr
The `NextColumn` method of `RowGroupWriter` returns a `ColumnWriter`, which writes physical values to the file,
and can write definition level and repetition level data to support nullable and array values.

### Using LogicalColumnWriter

Rather than working with a `ColumnWriter` directly, it's usually more convenient to create a `LogicalColumnWriter`
with the `ColumnWriter.LogicalWriter<TElement>` method.
This allows writing an array or `ReadOnlySpan` of `TElement` to the column data,
Expand Down Expand Up @@ -132,6 +143,8 @@ for (int columnIndex = 0; columnIndex < file.NumColumns; ++columnIndex)
}
```

### Closing the ParquetFileWriter

Note that it's important to explicitly call `Close` on the `ParquetFileWriter` when writing is complete,
as otherwise any errors encountered when writing may be silently ignored:

Expand Down

0 comments on commit 69afa87

Please sign in to comment.