Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update #53

Merged
merged 3 commits into from
Mar 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions DEVELOP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Development

- Install [wasm-pack](https://rustwasm.github.io/wasm-pack/)
- Compile: `wasm-pack build`, or change targets, e.g. `wasm-pack build --target nodejs`
- Publish `wasm-pack publish`.

### MacOS

Some steps may need a specific configuration if run on MacOS. Specifically, the default `clang` shipped with Macs (as of March 2022) doesn't have WebAssembly compilation supported out of the box. To build ZSTD, you may need to install a later version via Homebrew and update your paths to find the correct executables.

```
brew install llvm
export PATH="/usr/local/opt/llvm/bin/:$PATH"
export CC=/usr/local/opt/llvm/bin/clang
export AR=/usr/local/opt/llvm/bin/llvm-ar
```

See [this description](https://github.com/kylebarron/parquet-wasm/pull/2#issue-1159174043) and its references for more info.

## Publishing

`wasm-pack` supports [three different targets](https://rustwasm.github.io/docs/wasm-pack/commands/build.html#target):

- `bundler` (used with bundlers like Webpack)
- `nodejs` (used with Node, supports `require`)
- `web` (used as an ES module directly from the web)

There are good reasons to distribute as any of these... so why not distribute as all three? `wasm-pack` doesn't support this directly but the build script in `scripts/build.sh` calls `wasm-pack` three times and merges the outputs. This means that bundler users can use the default, Node users can use `parquet-wasm/node` and ES Modules users can use `parquet-wasm/web` in their imports.

To publish:

```
bash ./scripts/build.sh
wasm-pack publish
```
208 changes: 147 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# `parquet-wasm`

WebAssembly bindings to read and write the Parquet format to Apache Arrow.
WebAssembly bindings to read and write the [Apache Parquet](https://parquet.apache.org/) format to and from [Apache Arrow](https://arrow.apache.org/).

This is designed to be used alongside a JavaScript [Arrow](https://arrow.apache.org/) implementation, such as the canonical [JS Arrow library](https://arrow.apache.org/docs/js/) or potentially [`arrow-wasm`](https://github.com/domoritz/arrow-wasm).
This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical [JS Arrow library](https://arrow.apache.org/docs/js/).

Including all compression codecs, the generated brotli-encoded WASM bundle is 881KB.
Including all compression codecs, the brotli-encoded WASM bundle is 881KB.

## Install

Expand All @@ -18,61 +18,133 @@ npm install parquet-wasm

## API

### `readParquet`
### Two APIs?

These bindings expose _two_ APIs to users because there are _two separate implementations_ of Parquet and Arrow in Rust.

- [`parquet`](https://crates.io/crates/parquet) and [`arrow`](https://crates.io/crates/arrow): These are the "official" Rust implementations of Arrow and Parquet. These projects started earlier and may be more feature complete.
- [`parquet2`](https://crates.io/crates/parquet2) and [`arrow2`](https://crates.io/crates/arrow2): These are safer (in terms of memory access) and claim to be faster, though I haven't written my own benchmarks yet.

Since these parallel projects exist, why not give the user the choice of which to use? In general the reading API is identical in both APIs, however the write options differ between the two projects.

### Choice of bundles

Presumably no one wants to use both `parquet` and `parquet2` at once, so the default bundles separate `parquet` and `parquet2` into separate entry points to keep bundle size as small as possible. The following describe the six bundles available:

| Entry point | Rust crates used | Description |
| ---------------------------- | ----------------------- | ------------------------------------------------------- |
| `parquet-wasm` | `parquet` and `arrow` | "Bundler" build, to be used in bundlers such as Webpack |
| `parquet-wasm/node` | `parquet` and `arrow` | Node build, to be used with `require` in NodeJS |
| `parquet-wasm/web` | `parquet` and `arrow` | ESM, to be used directly from the Web as an ES Module |
| | | |
| `parquet-wasm/parquet_wasm2` | `parquet2` and `arrow2` | "Bundler" build, to be used in bundlers such as Webpack |
| `parquet-wasm/node2` | `parquet2` and `arrow2` | Node build, to be used with `require` in NodeJS |
| `parquet-wasm/web2` | `parquet2` and `arrow2` | ESM, to be used directly from the Web as an ES Module |

Note that when using the `/web` and `/web2` bundles, the default export must be awaited. See [here](https://rustwasm.github.io/docs/wasm-bindgen/examples/without-a-bundler.html) for an example.

### `parquet` API

This implementation uses the [`arrow`](https://crates.io/crates/arrow) and [`parquet`]() Rust crates.

#### `readParquet`

`readParquet(parquet_file: Uint8Array): Uint8Array`

Takes as input a `Uint8Array` containing bytes from a loaded Parquet file. Returns a `Uint8Array` with data in [Arrow IPC **Stream** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) [^0]. To parse this into an Arrow table, use `arrow.tableFromIPC` in the JS bindings on the result from `readParquet`.
Takes as input a `Uint8Array` containing bytes from a loaded Parquet file. Returns a `Uint8Array` with data in [Arrow IPC **Stream** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format). To parse this into an Arrow table, pass the result of `readParquet` to `arrow.tableFromIPC` in the JS bindings.

[^0]: I originally decoded Parquet files to the Arrow IPC File format, but Arrow JS occasionally produced bugs such as `Error: Expected to read 1901288 metadata bytes, but only read 644` when parsing using `arrow.tableFromIPC`. When testing the same buffer in Pyarrow, `pa.ipc.open_file` succeeded but `pa.ipc.open_stream` failed, leading me to believe that the Arrow JS implementation has some bugs to decide when `arrow.tableFromIPC` should internally use the `RecordBatchStreamReader` vs the `RecordBatchFileReader`.
#### `writeParquet`

### `writeParquet`
`writeParquet(arrow_file: Uint8Array, writer_properties: WriterProperties): Uint8Array`

`writeParquet(arrow_file: Uint8Array): Uint8Array`
Takes as input a `Uint8Array` containing bytes in [Arrow IPC **Stream** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format). If you have an Arrow table, call `arrow.tableToIPC(table, 'stream')` and pass the result to `writeParquet`.

Takes as input a `Uint8Array` containing bytes in [Arrow IPC **File** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) [^1]. If you have an Arrow table, call `arrow.tableToIPC(table, 'file')` and pass the result to `writeParquet`.
The second argument must be an instance of `WriterProperties`, which can be created by calling `new WriterPropertiesBuilder().build()`.

[^1]: I'm not great at Rust and the IPC File format seemed easier to parse in Rust than the IPC Stream format :slightly_smiling_face:.
#### `WriterPropertiesBuilder`

For the initial release, `writeParquet` is hard-coded to use Snappy compression and Plain encoding. In the future these should be made configurable.
A class to build a configuration used for writing a parquet file.

### `setPanicHook`
For example, to create a writing configuration with Snappy compression:

`setPanicHook(): void`
```js
import {
WriterPropertiesBuilder,
Compression,
writeParquet,
} from "parquet-wasm";

const writerProperties = new WriterPropertiesBuilder()
.setCompression(Compression.SNAPPY)
.build();
writeParquet(new Uint8Array(), writerProperties);
```

Sets [`console_error_panic_hook`](https://github.com/rustwasm/console_error_panic_hook) in Rust, which provides better debugging of panics by having more informative `console.error` messages. Initialize this first if you're getting errors such as `RuntimeError: Unreachable executed`.
### `parquet2` API

#### `readParquet2`

## Using
`readParquet2(parquet_file: Uint8Array): Uint8Array`

`parquet-wasm` is distributed with three bindings for use in different environments.
Takes as input a `Uint8Array` containing bytes from a loaded Parquet file. Returns a `Uint8Array` with data in [Arrow IPC **Stream** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format). To parse this into an Arrow table, pass the result of `readParquet2` to `arrow.tableFromIPC` in the JS bindings.

- Default, to be used in bundlers such as Webpack: `import * as parquet from 'parquet-wasm'`
- Node, to be used with `require` in NodeJS: `const parquet = require('parquet-wasm/node');`
- ESM, to be used directly from the Web as an ES Module: `import * as parquet from 'parquet-wasm/web';`
#### `writeParquet2`

`writeParquet2(arrow_file: Uint8Array, writer_properties: WriterProperties): Uint8Array`

Takes as input a `Uint8Array` containing bytes in [Arrow IPC **File** format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) [^1]. If you have an Arrow table, call `arrow.tableToIPC(table, 'file')` and pass the result to `writeParquet2`.

[^1]: I'm not great at Rust and the IPC File format seemed easier to parse in Rust than the IPC Stream format :slightly_smiling_face:. Hopefully this function will standardize on the Stream format in the future.

For example, to create a writing configuration with Snappy compression:

```js
import {
WriterPropertiesBuilder,
Compression,
writeParquet2,
} from "parquet-wasm";

const writerProperties = new WriterPropertiesBuilder()
.setCompression(Compression.SNAPPY)
.build();
writeParquet2(new Uint8Array(), writerProperties);
```

### Utilities

#### `setPanicHook`

`setPanicHook(): void`

Sets [`console_error_panic_hook`](https://github.com/rustwasm/console_error_panic_hook) in Rust, which provides better debugging of panics by having more informative `console.error` messages. Initialize this first if you're getting errors such as `RuntimeError: Unreachable executed`.

The WASM bundle must be compiled with the `console_error_panic_hook` for this function to exist.

## Example

```js
import {tableFromArrays, tableFromIPC, tableToIPC} from 'apache-arrow';
import {readParquet, writeParquet} from "parquet-wasm";
import { tableFromArrays, tableFromIPC, tableToIPC } from "apache-arrow";
import { readParquet, writeParquet } from "parquet-wasm";

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from(
{ length: LENGTH },
() => Number((Math.random() * 20).toFixed(1)));
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
{ length: LENGTH },
(_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));
{ length: LENGTH },
(_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = tableFromArrays({
precipitation: rainAmounts,
date: rainDates
precipitation: rainAmounts,
date: rainDates,
});

// Write Arrow Table to Parquet
const parquetBuffer = writeParquet(tableToIPC(rainfall, 'file'));
const parquetBuffer = writeParquet(tableToIPC(rainfall, "stream"));

// Read Parquet buffer back to Arrow Table
const table = tableFromIPC(readParquet(parquetBuffer));
Expand All @@ -88,52 +160,66 @@ The Parquet specification permits several compression codecs. This library curre
- [x] Snappy
- [x] Gzip
- [x] Brotli
- [x] ZSTD
- [ ] LZ4
- [ ] ZSTD. Will be supported using the next versions of the upstream packages `parquet` and `parquet2`.
- [ ] LZ4. Work is progressing but no support yet.

LZ4 compression appears not to work yet. When trying to parse a file with LZ4 compression I get an error: `Uncaught (in promise) External format error: underlying IO error: WrongMagicNumber`.
## Custom builds

## Future work
In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of `parquet-wasm` to keep bundle size at a minimum. If you install the Rust toolchain and `wasm-pack` (see [Development](#development)), you can create a custom build with only the compression codecs you require.

- [ ] Tests :smile:
- [ ] User-specified column-specific encodings when writing
- [ ] User-specified compression codec when writing
### Example custom builds

## Development

- Install [wasm-pack](https://rustwasm.github.io/wasm-pack/)
- Compile: `wasm-pack build`, or change targets, e.g. `wasm-pack build --target nodejs`
- Publish `wasm-pack publish`.

### MacOS

Some steps may need a specific configuration if run on MacOS. Specifically, the default `clang` shipped with Macs (as of March 2022) doesn't have WebAssembly compilation supported out of the box. To build ZSTD, you may need to install a later version via Homebrew and update your paths to find the correct executables.
Reader-only bundle with Snappy compression using the `arrow` and `parquet` crates:

```
brew install llvm
export PATH="/usr/local/opt/llvm/bin/:$PATH"
export CC=/usr/local/opt/llvm/bin/clang
export AR=/usr/local/opt/llvm/bin/llvm-ar
wasm-pack build --no-default-features --features arrow1 --features parquet/snap --features reader
```

See [this description](https://github.com/kylebarron/parquet-wasm/pull/2#issue-1159174043) and its references for more info.
Writer-only bundle with no compression support using the `arrow2` and `parquet2` crates, targeting Node:

### Publishing
```
wasm-pack build --target nodejs --no-default-features --features arrow2 --features writer
```

`wasm-pack` supports [three different targets](https://rustwasm.github.io/docs/wasm-pack/commands/build.html#target):
Debug bundle with reader and writer support, targeting Node, using `arrow` and `parquet` crates with all their supported compressions, with `console_error_panic_hook` enabled:

- `bundler` (used with bundlers like Webpack)
- `nodejs` (used with Node, supports `require`)
- `web` (used as an ES module directly from the web)
```bash
wasm-pack build --dev --target nodejs \
--no-default-features --features arrow1 \
--features reader --features writer \
--features parquet_supported_compressions \
--features console_error_panic_hook
# Or, given the fact that the default feature includes several of these features, a shorter version:
wasm-pack build --dev --target nodejs --features console_error_panic_hook
```

There are good reasons to distribute as any of these... so why not distribute as all three? `wasm-pack` doesn't support this directly but the build script in `scripts/build.sh` calls `wasm-pack` three times and merges the outputs. This means that bundler users can use the default, Node users can use `parquet-wasm/node` and ES Modules users can use `parquet-wasm/web` in their imports.
Refer to the [`wasm-pack` documentation](https://rustwasm.github.io/docs/wasm-pack/commands/build.html) for more info on flags such as `--release`, `--dev`, `target`, and to the [Cargo documentation](https://doc.rust-lang.org/cargo/reference/features.html) for more info on how to use features.

### Available features

- `arrow1`: Use the `arrow` and `parquet` crates
- `arrow2`: Use the `arrow2` and `parquet2` crates
- `reader`: Activate read support.
- `writer`: Activate write support.
- `parquet_supported_compressions`: Activate all supported compressions for the `parquet` crate
- `parquet2_supported_compressions`: Activate all supported compressions for the `parquet2` crate
- parquet compression features. Should only be activated when `arrow1` is activated.
- `parquet/brotli`: Activate Brotli compression in the `parquet` crate.
- `parquet/flate2`: Activate Gzip compression in the `parquet` crate.
- `parquet/snap`: Activate Snappy compression in the `parquet` crate.
- ~~`parquet/lz4`~~: ~~Activate LZ4 compression in the `parquet` crate.~~ WASM-compatible version not yet implemented in the `parquet` crate.
- ~~`parquet/zstd`~~: ~~Activate ZSTD compression in the `parquet` crate.~~ ZSTD should work in parquet's next release, pending https://github.com/apache/arrow-rs/pull/1414
- parquet2 compression features. Should only be activated when `arrow2` is activated.
- `parquet2/brotli`: Activate Brotli compression in the `parquet2` crate.
- `parquet2/gzip`: Activate Gzip compression in the `parquet2` crate.
- `parquet2/snappy`: Activate Snappy compression in the `parquet2` crate.
- ~~`parquet2/lz4`~~: ~~Activate LZ4 compression in the `parquet2` crate~~. WASM-compatible version not yet implemented, pending https://github.com/jorgecarleitao/parquet2/pull/91
- ~~`parquet2/zstd`~~: ~~Activate ZSTD compression in the `parquet2` crate.~~ ZSTD should work in parquet2's next release.
- `console_error_panic_hook`: Expose the `setPanicHook` function for better error messages for Rust panics.

To publish:
## Future work

```
bash ./scripts/build.sh
wasm-pack publish
```
- [ ] More tests :smile:

## Acknowledgements

Expand Down
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
{
"scripts": {
"build": "bash ./scripts/build.sh",
"test": "ts-node node_modules/tape/bin/tape ./tests/js/index.ts"
},
"devDependencies": {
Expand Down
2 changes: 2 additions & 0 deletions src/arrow2/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ pub mod wasm;

#[cfg(feature = "writer")]
pub mod writer;

#[cfg(feature = "writer")]
pub mod writer_properties;
1 change: 1 addition & 0 deletions src/common/mod.rs
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
#[cfg(feature = "writer")]
pub mod writer_properties;
7 changes: 0 additions & 7 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,8 @@ mod arrow2;
mod common;
mod utils;

use wasm_bindgen::prelude::*;

// When the `wee_alloc` feature is enabled, use `wee_alloc` as the global
// allocator.
/*#[cfg(feature = "wee_alloc")]
#[global_allocator]
static ALLOC: wee_alloc::WeeAlloc = wee_alloc::WeeAlloc::INIT;*/

#[wasm_bindgen(js_name = setPanicHook)]
pub fn set_panic_hook() {
crate::utils::set_panic_hook();
}
3 changes: 2 additions & 1 deletion src/utils.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
use js_sys::Uint8Array;
use wasm_bindgen::prelude::*;

#[cfg(feature = "console_error_panic_hook")]
#[wasm_bindgen(js_name = setPanicHook)]
pub fn set_panic_hook() {
// When the `console_error_panic_hook` feature is enabled, we can call the
// `set_panic_hook` function at least once during initialization, and then
// we will get better error messages if our code ever panics.
//
// For more details see
// https://github.com/rustwasm/console_error_panic_hook#readme
#[cfg(feature = "console_error_panic_hook")]
console_error_panic_hook::set_once();
}

Expand Down