-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure and expand documentation #257
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# ParquetSharp in PowerShell | ||
|
||
It's possible to use ParquetSharp from PowerShell. | ||
You can install ParquetSharp with the [NuGet command line interface](https://docs.microsoft.com/en-us/nuget/reference/nuget-exe-cli-reference), | ||
then use `Add-Type` to load `ParquetSharp.dll`. | ||
However, you must ensure that the appropriate `ParquetSharpNative.dll` for your architecture and OS can be loaded as required, | ||
either by putting it somewhere in your `PATH` or in the same directory as `ParquetSharp.dll`. | ||
For examples of how to use ParquetSharp from PowerShell, | ||
see [these scripts from Apteco](https://github.com/Apteco/HelperScripts/tree/master/scripts/parquet). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Reading Parquet files | ||
|
||
The low-level ParquetSharp API provides the `ParquetFileReader` class for reading Parquet files. | ||
This is usually constructed from a file path, but may also be constructed from a `ManagedRandomAccessFile`, | ||
which wraps a .NET `System.IO.Stream` that supports seeking. | ||
|
||
```csharp | ||
using var fileReader = new ParquetFileReader("data.parquet"); | ||
``` | ||
or | ||
```csharp | ||
using var input = new ManagedRandomAccessFile(File.OpenRead("data.parquet")); | ||
using var fileReader = new ParquetFileReader(input); | ||
``` | ||
|
||
The `FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema: | ||
```csharp | ||
int numColumns = fileReader.FileMetaData.NumColumns; | ||
long numRows = fileReader.FileMetaData.NumRows; | ||
int numRowGroups = fileReader.FileMetaData.NumRowGroups; | ||
IReadOnlyDictionary<string, string> metadata = fileReader.FileMetaData.KeyValueMetadata; | ||
|
||
SchemaDescriptor schema = fileReader.FileMetaData.Schema; | ||
for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) { | ||
ColumnDescriptor colum = schema.Column(columnIndex); | ||
string columnName = column.Name; | ||
} | ||
``` | ||
|
||
Parquet files store data in separate row groups, which all share the same schema, | ||
so if you wish to read all data in a file, you generally want to loop over all of the row groups | ||
and create a `RowGroupReader` for each one: | ||
|
||
```csharp | ||
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup) { | ||
using var rowGroupReader = fileReader.RowGroup(rowGroup); | ||
long groupNumRows = rowGroupReader.MetaData.NumRows; | ||
} | ||
``` | ||
|
||
The `Column` method of `RowGroupReader` takes an integer column index and returns a `ColumnReader` object, | ||
which can read primitive values from the column, as well as raw definition level and repetition level data. | ||
Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to | ||
create a `LogicalColumnReader` that can read logical values. | ||
There are two variations of this `LogicalReader` method; the plain `LogicalReader` method returns an abstract | ||
`LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`, | ||
which reads values of the specified element type. | ||
|
||
If you know ahead of time the data types for the columns you will read, you can simply use the generic methods and | ||
read values directly. For example, to read data from the first column which represents a timestamp: | ||
|
||
```csharp | ||
DateTime[] timestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(numRows); | ||
``` | ||
|
||
However, if you don't know ahead of time the types for each column, you can implement the | ||
`ILogicalColumnReaderVisitor<TReturn>` interface to handle column data in a type-safe way, for example: | ||
|
||
```csharp | ||
sealed class ColumnPrinter : ILogicalColumnReaderVisitor<string> | ||
{ | ||
public string OnLogicalColumnReader<TElement>(LogicalColumnReader<TElement> columnReader) | ||
{ | ||
var stringBuilder = new StringBuilder(); | ||
foreach (var value in columnReader) { | ||
stringBuilder.Append(value?.ToString() ?? "null"); | ||
stringBuilder.Append(","); | ||
} | ||
return stringBuilder.ToString(); | ||
} | ||
} | ||
|
||
string columnValues = rowGroupReader.Column(0).LogicalReader().Apply(new ColumnPrinter()); | ||
``` | ||
|
||
There's a similar `IColumnReaderVisitor<TReturn>` interface for working with `ColumnReader` objects | ||
and reading physical values in a type-safe way, but most users will want to work at the logical element level. | ||
|
||
The `LogicalColumnReader<TElement>` class provides multiple ways to read data. | ||
It implements `IEnumerable<TElement>` which internally buffers batches of data and iterates over them, | ||
but for more fine-grained control over reading behaviour, you can read into your own buffer. For example: | ||
|
||
```csharp | ||
var buffer = new TElement[4096]; | ||
|
||
while (logicalColumnReader.HasNext) | ||
{ | ||
int numRead = logicalColumnReader.ReadBatch(buffer); | ||
|
||
for (int i = 0; i != numRead; ++i) | ||
{ | ||
TElement value = buffer[i]; | ||
// Use value | ||
} | ||
} | ||
``` | ||
|
||
The .NET type used to represent read values can optionally be overridden by using the `ColumnReader.LogicalReaderOverride<TElement>` method. | ||
For more details, see the [type factories documentation](TypeFactories.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# Row-oriented API | ||
|
||
The row-oriented API offers a convenient way to abstract the column-oriented nature of Parquet files | ||
at the expense of memory, speed and flexibility. | ||
It lets one write a whole row in a single call, often resulting in more readable code. | ||
|
||
For example, writing a file with the row-oriented API and using a tuple to represent a row of values: | ||
|
||
```csharp | ||
var timestamps = new DateTime[] { /* ... */ }; | ||
var objectIds = new int[] { /* ... */ }; | ||
var values = timestamps.Select(t => objectIds.Select(o => (float) rand.NextDouble()).ToArray()).ToArray(); | ||
var columns = new[] {"Timestamp", "ObjectId", "Value"}; | ||
|
||
using var rowWriter = ParquetFile.CreateRowWriter<(DateTime, int, float)>("float_timeseries.parquet", columns); | ||
|
||
for (int i = 0; i != timestamps.Length; ++i) | ||
{ | ||
for (int j = 0; j != objectIds.Length; ++j) | ||
{ | ||
rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j])); | ||
} | ||
} | ||
|
||
// Write a new row group (pretend we have new timestamps, objectIds and values) | ||
rowWriter.StartNewRowGroup(); | ||
for (int i = 0; i != timestamps.Length; ++i) | ||
{ | ||
for (int j = 0; j != objectIds.Length; ++j) | ||
{ | ||
rowWriter.WriteRow((timestamps[i], objectIds[j], values[i][j])); | ||
} | ||
} | ||
|
||
rowWriter.Close(); | ||
``` | ||
|
||
Internally, ParquetSharp will build up a buffer of row values and then write each column when the file | ||
is closed or a new row group is started. | ||
This means all values in a row group must be stored in memory at once, | ||
and the row values buffer must be resized and copied as it grows. | ||
Therefore, it's recommended to use the lower-level column oriented API if performance is a concern. | ||
|
||
## Explicit column mapping | ||
|
||
The row-oriented API allows for specifying your own name-independent/order-independent column mapping using the optional `MapToColumn` attribute. | ||
|
||
```csharp | ||
struct MyRow | ||
{ | ||
[MapToColumn("ColumnA")] | ||
public long MyKey; | ||
|
||
[MapToColumn("ColumnB")] | ||
public string MyValue; | ||
} | ||
|
||
using (var rowReader = ParquetFile.CreateRowReader<MyRow>("example.parquet")) | ||
{ | ||
for (int i = 0; i < rowReader.FileMetaData.NumRowGroups; ++i) | ||
{ | ||
var values = rowReader.ReadRows(i); | ||
foreach (MyRow r in values) | ||
{ | ||
Console.WriteLine(r.MyKey + "/" + r.MyValue); | ||
} | ||
} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth saying something about the drawbacks?