Use custom thrift decoder to improve speed of parsing parquet metadata #5854

alamb · 2024-06-07T12:26:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5853

Parsing the parquet metadata takes substantial time and most of that time is spent in decoding the thrift format (@XiangpengHao is quantifying this in #5770)

Describe the solution you'd like
Improve the thrift decoder speed

Describe alternatives you've considered
@jhorstmann reports on #5775 that he made a prototype of this:

          I had an attack of "not invented here" syndrome the last few days 😅 and worked on an alternative code generator for thrift, that would allow me to more easily try out some changes to the generated code. The repo can be found at <https://github.com/jhorstmann/compact-thrift/> and the output for `parquet.thrift` at <https://github.com/jhorstmann/compact-thrift/blob/main/src/main/rust/tests/parquet.rs>.

The current output is still doing allocations for string and binary, but running the benchmarks from https://github.com/tustvold/arrow-rs/tree/thrift-bench shows some nice improvements. This is the comparison with current arrow-rs code, so both versions should be doing the same amount of allocations:

decode metadata      time:   [32.592 ms 32.645 ms 32.702 ms]

decode metadata new  time:   [17.440 ms 17.476 ms 17.532 ms]

So incidentally very close to that 2x improvement.

The main difference in the code should be avoiding most of the abstractions from TInputProtocol and avoiding stack moves by directly writing into default-initialized structs instead of moving from local variables.

Originally posted by @jhorstmann in #5775 (comment)

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-06-07T12:29:09Z

FWIW another potential possibility is to hand-write a thrift decoder for the parquet metadata rather than relying on a code generator. That would likely result in the fastest decode time, but would also be the hardest to maintain.

jhorstmann · 2024-06-07T13:30:25Z

Thanks @alamb for creating this tracking issue. I've slowly continued working on my code at jhorstmann/compact-thrift and benchmarks are looking good. So good in fact that adapting it on top of #5777 the performance hotspot shifts to the conversion functions from generated thrift types to internal types.

I would love to get some feedback on the code, and whether there would be a preference to integrate the parquet definitions into the arrow-rs repo, or publish them separately.

The generated and runtime code is also structured in a way that it would not be too crazy to write bindings to custom types by hand.

Direct links to the generated code and to the runtime library.

XiangpengHao · 2024-06-07T14:06:20Z

FWIW, by simply moving this field to heap (i.e., Option<Statistics> -> Option<Box<Statistics>>), we can get 30% performance improvement (as will show in blog #5770).

arrow-rs/parquet/src/format.rs

Line 3407 in 087f34b

pub statistics: Option<Statistics>,

The Option<Statistics> occupies 136 bytes even if the file does not have stats at all (i.e., the field is None); this not only slows down decoding (due to poor memory locality) but also causes high memory consumption when decoding metadata (parquet-rs consumes 10MB memory per MB of metadata).

I think this example motivates custom parquet type definitions and, thus, custom thrift decoder.

alamb · 2024-06-10T18:00:42Z

FWIW, by simply moving this field to heap (i.e., Option<Statistics> -> Option<Box<Statistics>>), we can get 30% performance improvement (as will show in blog #5770).

This code is in #5856 for anyone who is curious

alamb · 2024-06-10T19:03:33Z

I would love to get some feedback on the code, and whether there would be a preference to integrate the parquet definitions into the arrow-rs repo, or publish them separately.

Hi @jhorstmann -- I had a look at https://github.com/jhorstmann/compact-thrift/tree/main (very cool)

Some initial reactions:
Writing a code generator in Kotlin is a neat idea, but I think it might make the barrier to contribution high (now people need to know Rust and Kotlin (and the associated toolchains, etc)

Also, I keep thinking if we are going to have a parquet-rs specific implementation, why use a code generator at all? Maybe we could simply hand code a decoder directly that uses the runtime library

Given how infrequently the parquet spec changes, a hand rolled parser might be reasonable (though I will admin that the generated format.rs is substantial 🤔 ). We can probably ensure compatibility with round trip testing of generated rust code 🤔

jhorstmann · 2024-06-11T08:23:47Z

Writing a code generator in Kotlin is a neat idea, but I think it might make the barrier to contribution high (now people need to know Rust and Kotlin (and the associated toolchains, etc)

I agree, in the context of arrow-rs this is probably a bigger barrier to contribute than the existing C++ based thrift code generator.

Maybe the amount of code could be simplified and made easier to change by hand with the use of some macros. The most tricky part of the code generation, difficult to replicate in a macro, might be the decision of which structs require lifetime annotations.

alamb · 2024-06-13T21:50:57Z

The more I think about this the more I am convinced that the fastest thing to do would be to decode directly from thrift --> the parquet-rs structs. Perhaps we could follow the tape decoding model of the csv or json parsers in this repo 🤔

Decoding to intermediate thrift structures which are then throw away seems like an obvious source of improvement

jhorstmann · 2024-06-17T21:23:25Z

It occurred to me that the thrift definitions consist entirely of valid rust tokens, and so should be parseable using declarative macros. The result of that experiment can be seen in #5909, the complete macro can be found at https://github.com/jhorstmann/compact-thrift/blob/main/src/main/rust/runtime/src/macros.rs

alamb · 2024-06-18T10:34:52Z

It occurred to me that the thrift definitions consist entirely of valid rust tokens, and so should be parseable using declarative macros.

That is really (really) cool @jhorstmann

Maybe we could even use the declarative macros to creating a parser that avoids intermediates, by providing callbacks rather than building structs 🤔

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 7, 2024

alamb mentioned this issue Jun 7, 2024

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

Open

4 tasks

alamb added the parquet Changes to the parquet crate label Jun 7, 2024

alamb mentioned this issue Jun 7, 2024

Reduce Allocations When Reading Parquet Metadata #5775

Open

XiangpengHao mentioned this issue Jun 7, 2024

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Open

jhorstmann mentioned this issue Jun 11, 2024

Benchmarks for custom parquet format #5869

Closed

jhorstmann mentioned this issue Jun 17, 2024

Generate thrift definitions using macro from compact-thrift-rs #5909

Closed

etseidl mentioned this issue Jun 28, 2024

Add size statistics to ParquetMetaData introduced in PARQUET-2261 #5486

Closed

alamb mentioned this issue Jul 26, 2024

[DISCUSSION] Parquet Metadata Improvements #6129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

alamb commented Jun 7, 2024

alamb commented Jun 7, 2024

jhorstmann commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024

alamb commented Jun 10, 2024

alamb commented Jun 10, 2024

jhorstmann commented Jun 11, 2024

alamb commented Jun 13, 2024

jhorstmann commented Jun 17, 2024

alamb commented Jun 18, 2024

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Comments

alamb commented Jun 7, 2024

alamb commented Jun 7, 2024

jhorstmann commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024

alamb commented Jun 10, 2024

alamb commented Jun 10, 2024

jhorstmann commented Jun 11, 2024

alamb commented Jun 13, 2024

jhorstmann commented Jun 17, 2024

alamb commented Jun 18, 2024