Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure and expand documentation #257

Merged
merged 4 commits into from
Mar 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 31 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,43 +25,17 @@ Supported platforms:
| **Pre-Release Nuget** | [![NuGet latest pre-release](https://img.shields.io/nuget/vpre/ParquetSharp.svg)](https://www.nuget.org/packages/ParquetSharp/absoluteLatest) |
| **CI Build** | [![CI Status](https://github.com/G-Research/ParquetSharp/actions/workflows/ci.yml/badge.svg?branch=master&event=push)](https://github.com/G-Research/ParquetSharp/actions/workflows/ci.yml?query=branch%3Amaster+event%3Apush) |

## Examples
## Quickstart

Both examples below output a Parquet file with three columns representing a timeseries of object-value pairs ordered by datetime and object id.
The following examples show how to write and then read a Parquet file with three columns representing a timeseries of object-value pairs.
These use the low-level API, which is the recommended API and closely maps to the API of Apache Parquet C++.

### Row-oriented API

The row-oriented API offers a convenient way to abstract the column-oriented nature of Parquet files at the expense of memory, speed and flexibility. It lets one write a whole row in a single call, often resulting in more readable code.

```csharp
var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
var columns = new[] {"Timestamp", "ObjectId", "Value"};

using var rowWriter = ParquetFile.CreateRowWriter<(DateTime, int, float)>("float_timeseries.parquet", columns);

for (int i = 0; i != timestamps.Length; ++i)
{
for (int j = 0; j != objectIds.Length; ++j)
{
rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
}
}

rowWriter.Close();
```

The column names can also be explicitly given, see [Row-oriented API (Advanced)](RowOriented.md) for more details.

### Low-level API

This closely maps to the API of Apache Parquet C++. It also provides reader and writer abstractions (`LogicalColumnReader` and `LogicalColumnWriter` respectively) to convert between .NET types and Parquet representations. This is the recommended API.
Writing a Parquet File:

```csharp
var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
var values = new float[] { /* ... */ };

var columns = new Column[]
{
Expand All @@ -75,44 +49,46 @@ using var rowGroup = file.AppendRowGroup();

using (var timestampWriter = rowGroup.NextColumn().LogicalWriter<DateTime>())
{
for (int i = 0; i != timestamps.Length; ++i)
{
timestampWriter.WriteBatch(Enumerable.Repeat(timestamps[i], objectIds.Length).ToArray());
}
timestampWriter.WriteBatch(timestamps);
}

using (var objectIdWriter = rowGroup.NextColumn().LogicalWriter<int>())
{
for (int i = 0; i != timestamps.Length; ++i)
{
objectIdWriter.WriteBatch(objectIds);
}
objectIdWriter.WriteBatch(objectIds);
}

using (var valueWriter = rowGroup.NextColumn().LogicalWriter<float>())
{
for (int i = 0; i != timestamps.Length; ++i)
{
valueWriter.WriteBatch(values[i]);
}
valueWriter.WriteBatch(values);
}

file.Close();
```

### Custom Types
Reading the file back:

```csharp
using var file = new ParquetFileReader("float_timeseries.parquet");

for (int rowGroup = 0; rowGroup < file.FileMetaData.NumRowGroups; ++rowGroup) {
using var rowGroupReader = file.RowGroup(rowGroup);
var groupNumRows = checked((int) rowGroupReader.MetaData.NumRows);

var groupTimestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(groupNumRows);
var groupObjectIds = rowGroupReader.Column(1).LogicalReader<int>().ReadAll(groupNumRows);
var groupValues = rowGroupReader.Column(2).LogicalReader<float>().ReadAll(groupNumRows);
}

file.Close();
```

ParquetSharp allows the user to override the mapping between C# and Parquet types. Check the [Type Factories documentation](TypeFactories.md) for more information.
## Documentation

### PowerShell
For more detailed information on how to use ParquetSharp, see the following documentation:

It's possible to use ParquetSharp from PowerShell.
You can install ParquetSharp with the [NuGet command line interface](https://docs.microsoft.com/en-us/nuget/reference/nuget-exe-cli-reference),
then use `Add-Type` to load `ParquetSharp.dll`.
However, you must ensure that the appropriate `ParquetSharpNative.dll` for your architecture and OS can be loaded as required,
either by putting it somewhere in your `PATH` or in the same directory as `ParquetSharp.dll`.
For examples of how to use ParquetSharp from PowerShell,
see [these scripts from Apteco](https://github.com/Apteco/HelperScripts/tree/master/scripts/parquet).
* [Writing parquet files](docs/Writing.md)
* [Reading parquet files](docs/Reading.md)
* [Row-oriented API](docs/RowOriented.md) &mdash; a higher level API that abstracts away the column-oriented nature of Parquet files
* [Custom types](docs/TypeFactories.md) &mdash; how to override the mapping between .NET and Parquet types
* [Use from PowerShell](docs/PowerShell.md)

## Rationale

Expand Down
28 changes: 0 additions & 28 deletions RowOriented.md

This file was deleted.

10 changes: 10 additions & 0 deletions docs/PowerShell.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# ParquetSharp in PowerShell

It's possible to use ParquetSharp from PowerShell.
You can install ParquetSharp with the [NuGet command line interface](https://docs.microsoft.com/en-us/nuget/reference/nuget-exe-cli-reference),
then use `Add-Type` to load `ParquetSharp.dll`.
However, you must ensure that the appropriate `ParquetSharpNative.dll` for your architecture and OS can be loaded as required,
either by putting it somewhere in your `PATH` or in the same directory as `ParquetSharp.dll`.
For examples of how to use ParquetSharp from PowerShell,
see [these scripts from Apteco](https://github.com/Apteco/HelperScripts/tree/master/scripts/parquet).

99 changes: 99 additions & 0 deletions docs/Reading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Reading Parquet files

The low-level ParquetSharp API provides the `ParquetFileReader` class for reading Parquet files.
This is usually constructed from a file path, but may also be constructed from a `ManagedRandomAccessFile`,
which wraps a .NET `System.IO.Stream` that supports seeking.

```csharp
using var fileReader = new ParquetFileReader("data.parquet");
```
or
```csharp
using var input = new ManagedRandomAccessFile(File.OpenRead("data.parquet"));
using var fileReader = new ParquetFileReader(input);
```

The `FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema:
```csharp
int numColumns = fileReader.FileMetaData.NumColumns;
long numRows = fileReader.FileMetaData.NumRows;
int numRowGroups = fileReader.FileMetaData.NumRowGroups;
IReadOnlyDictionary<string, string> metadata = fileReader.FileMetaData.KeyValueMetadata;

SchemaDescriptor schema = fileReader.FileMetaData.Schema;
for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) {
ColumnDescriptor colum = schema.Column(columnIndex);
string columnName = column.Name;
}
```

Parquet files store data in separate row groups, which all share the same schema,
so if you wish to read all data in a file, you generally want to loop over all of the row groups
and create a `RowGroupReader` for each one:

```csharp
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup) {
using var rowGroupReader = fileReader.RowGroup(rowGroup);
long groupNumRows = rowGroupReader.MetaData.NumRows;
}
```

The `Column` method of `RowGroupReader` takes an integer column index and returns a `ColumnReader` object,
which can read primitive values from the column, as well as raw definition level and repetition level data.
Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to
create a `LogicalColumnReader` that can read logical values.
There are two variations of this `LogicalReader` method; the plain `LogicalReader` method returns an abstract
`LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`,
which reads values of the specified element type.

If you know ahead of time the data types for the columns you will read, you can simply use the generic methods and
read values directly. For example, to read data from the first column which represents a timestamp:

```csharp
DateTime[] timestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(numRows);
```

However, if you don't know ahead of time the types for each column, you can implement the
`ILogicalColumnReaderVisitor<TReturn>` interface to handle column data in a type-safe way, for example:

```csharp
sealed class ColumnPrinter : ILogicalColumnReaderVisitor<string>
{
public string OnLogicalColumnReader<TElement>(LogicalColumnReader<TElement> columnReader)
{
var stringBuilder = new StringBuilder();
foreach (var value in columnReader) {
stringBuilder.Append(value?.ToString() ?? "null");
stringBuilder.Append(",");
}
return stringBuilder.ToString();
}
}

string columnValues = rowGroupReader.Column(0).LogicalReader().Apply(new ColumnPrinter());
```

There's a similar `IColumnReaderVisitor<TReturn>` interface for working with `ColumnReader` objects
and reading physical values in a type-safe way, but most users will want to work at the logical element level.

The `LogicalColumnReader<TElement>` class provides multiple ways to read data.
It implements `IEnumerable<TElement>` which internally buffers batches of data and iterates over them,
but for more fine-grained control over reading behaviour, you can read into your own buffer. For example:

```csharp
var buffer = new TElement[4096];

while (logicalColumnReader.HasNext)
{
int numRead = logicalColumnReader.ReadBatch(buffer);

for (int i = 0; i != numRead; ++i)
{
TElement value = buffer[i];
// Use value
}
}
```

The .NET type used to represent read values can optionally be overridden by using the `ColumnReader.LogicalReaderOverride<TElement>` method.
For more details, see the [type factories documentation](TypeFactories.md).
69 changes: 69 additions & 0 deletions docs/RowOriented.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Row-oriented API

The row-oriented API offers a convenient way to abstract the column-oriented nature of Parquet files
at the expense of memory, speed and flexibility.
It lets one write a whole row in a single call, often resulting in more readable code.

For example, writing a file with the row-oriented API and using a tuple to represent a row of values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth saying something about the drawbacks?


```csharp
var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray();
var columns = new[] {"Timestamp", "ObjectId", "Value"};

using var rowWriter = ParquetFile.CreateRowWriter<(DateTime, int, float)>("float_timeseries.parquet", columns);

for (int i = 0; i != timestamps.Length; ++i)
{
for (int j = 0; j != objectIds.Length; ++j)
{
rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
}
}

// Write a new row group (pretend we have new timestamps, objectIds and values)
rowWriter.StartNewRowGroup();
for (int i = 0; i != timestamps.Length; ++i)
{
for (int j = 0; j != objectIds.Length; ++j)
{
rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j]));
}
}

rowWriter.Close();
```

Internally, ParquetSharp will build up a buffer of row values and then write each column when the file
is closed or a new row group is started.
This means all values in a row group must be stored in memory at once,
and the row values buffer must be resized and copied as it grows.
Therefore, it's recommended to use the lower-level column oriented API if performance is a concern.

## Explicit column mapping

The row-oriented API allows for specifying your own name-independent/order-independent column mapping using the optional `MapToColumn` attribute.

```csharp
struct MyRow
{
[MapToColumn("ColumnA")]
public long MyKey;

[MapToColumn("ColumnB")]
public string MyValue;
}

using (var rowReader = ParquetFile.CreateRowReader<MyRow>("example.parquet"))
{
for (int i = 0; i < rowReader.FileMetaData.NumRowGroups; ++i)
{
var values = rowReader.ReadRows(i);
foreach (MyRow r in values)
{
Console.WriteLine(r.MyKey + "/" + r.MyValue);
}
}
}
```
14 changes: 7 additions & 7 deletions TypeFactories.md → docs/TypeFactories.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Type Factories
# Type Factories

ParquetSharp API exposes the logic that maps the C# types (called "logical system types" by ParquetSharp, as per Parquet's LogicalType) to the actual Parquet physical types, as well as the converters that are associated with them.

This means that:
- a user can potentially read/write any type they want, as long as they provide a viable mapping,
- a user can override the default ParquetSharp mapping and change how existing C# types are handled.

### API
## API

The API at the core of this is encompassed by `LogicalTypeFactory`, `LogicalReadConverterFactory` and `LogicalWriteConverterFactory`.

Expand All @@ -16,7 +16,7 @@ On top of that, if the custom type is used for creating the schema (when writing

In other words, the `LogicalTypeFactory` is required if the user provides a `Column` class with a custom type (writer only, the factory is needed to know the physical parquet type) or gets the `LogicalColumnReader/Writer` via the non type-overriding methods (in which case the factory is needed to know the full type of the logical column reader/writer). The corresponding converter factory is always needed.

### Examples
## Examples

One of the approaches for reading custom values can be described by the following code.

Expand All @@ -26,16 +26,16 @@ One of the approaches for reading custom values can be described by the followin
using var columnReader = groupReader.Column(0).LogicalReaderOverride<VolumeInDollars>();

var values = columnReader.ReadAll(checked((int) groupReader.MetaData.NumRows));

/* ... */

[StructLayout(LayoutKind.Sequential)]
private readonly struct VolumeInDollars
{
public VolumeInDollars(float value) { Value = value; }
public readonly float Value;
}

private sealed class ReadConverterFactory : LogicalReadConverterFactory
{
public override Delegate GetConverter<TLogical, TPhysical>(ColumnDescriptor columnDescriptor, ColumnChunkMetaData columnChunkMetaData)
Expand All @@ -46,4 +46,4 @@ One of the approaches for reading custom values can be described by the followin
}
```

But do check [TestLogicalTypeFactory.cs](csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
But do check [TestLogicalTypeFactory.cs](../csharp.test/TestLogicalTypeFactory.cs) for a more comprehensive set of examples, as there are many places that can be customized and optimized by the user.
Loading