Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

alamb · 2024-05-16T00:25:43Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example Lance V2 and Nimble

One of the technical challenges raised about Parquet is that the metadata is encoded such that the entire footer must be read and decoded prior to reading any data.

As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode

However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max statistical values for columns / pages

Describe the solution you'd like
I would like to gather data on parquet footer metadata size as a function of:

The number of columns
The number of row groups
if Statistics are enabled / disabled

And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas

Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)

Describe alternatives you've considered

Additional context
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826

Cited issue apache/arrow#39676

alamb · 2024-05-16T00:30:22Z

I have hopes that I will find time to work on this next week but am not sure

If anyone knows of prior art in this area (measuring the actual metadata size) it would be great if they linked it

alamb · 2024-05-16T00:33:41Z

One idea would be to create a dataset with 1000 columns, 2000 columns , 5000 columns and 10000 columns with say 10 row groups of 1M rows

Measure the size of the metadata with

default writer settings
"minimal" settings (no statistics, disable everything possible)

westonpace · 2024-05-16T02:14:26Z

10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong).

My experience has been either:

The files they make are much smaller (and thus not enough or undersized row groups) E.g. financial data where ticker is column.
The columns are very sparse (and thus a need for better sparse encoding). E.g. feature stores.

westonpace · 2024-05-16T03:34:03Z

Also, I did do some experimentation here (what feels like a long time ago now but was at least a year ago). Two things I observed:

The in-memory representation of the parquet metadata seemed way too high, orders of magnitude greater than the on-disk size of the metadata. I assume this was specific to the C++ impl and there was just something I was missing but it was pretty significant even with relatively reasonably-sized files (e.g. 50MiB+ per file).
The parquet-c++ library (and, to a lesser degree, arrow and arrow datasets) were not written with large #'s of columns in mind. So even if the format itself could work, the implementation ends up being O(N^2) in a lot of places where it could presumably do a lot better.

I realize this second point is kind of what you are trying to show. I'd be interested in measurements of the first point, at least enough to prove I was missing something (or this problem is unique to parquet-c++)

mapleFU · 2024-05-16T03:46:36Z

I don't have specific data by now. At my use case, usally row-group would be limit by size rather than row-count, and we would disable statistics for most of these column. And only write page index without column index in this case

alamb · 2024-05-16T09:07:15Z

At my use case, usally row-group would be limit by size rather than row-count, and we would disable statistics for most of these column. And only write page index without column index in this case

Yes, this is what I would expect best practice to be for large schemas (especially if the actual data is sparse as @westonpace says). In general I would expect the value for query engines of min/max statistics in parquet metadata to be quite low for columns where the data isn't sorted or clustered in some way.

Thus, if the default parquet writer settings always write this metadata for all columns maybe we need to adjust the default and docuemntation or something

tustvold · 2024-05-16T10:17:43Z

It may also be worth pointing out that recently support was added to truncate column statistics, page indexes have always supported this, and spark was actually already doing this for columns statistics, with dubious correctness properties. The net result though, is that even is you do write column statistics for all columns, they need not be very large

tustvold · 2024-05-16T10:50:25Z

I've also filed #5775 to track something I suggested on the mailing list, and would likely drastically reduce the overheads from metadata parsing.

tustvold · 2024-05-16T23:28:11Z

So I created a toy benchmark (https://github.com/tustvold/arrow-rs/tree/thrift-bench) of a 10,000 column parquet file, with f32 columns, no statistics and 10 row groups.

Default Config

The file metadata for this comes to a chunky 11MB, and it takes the #5777 thrift decoder on 45ms to parse this thrift payload.

If we drop the column count to 1000, the metadata drops to 1MB and parses in ~3.7ms.

If we also drop the row group count to 1, the metadata drops to 100KB and parses in 460 us.

So at a very rough level the cost is 10 bytes and ~450ns per column chunk

Drop ColumnMetadata

Dropping the ColumnMetadata from ColumnChunk drops the size down to 3.6MB and the parsing speed down to 23ms

Drop ColumnChunk

Dropping the columns drops the size down to 109KB and the parsing speed down to 718 us.

At this point we have effectively dropped everything apart from the schema.

Arrow Schema IPC

Now for comparison, I encoded a similar schema using arrow IPC to a flatbuffer. This came to a still pretty chunky 500KB. Validating the offsets in this flatbuffer takes ~1ms, which is longer than decoding the equivalent data from thrift.

Thoughts

There isn't anything wrong with using thrift, it is competitive if not faster than flatbuffers (when doing full offset validation)
The cost of encoding ColumnChunk dominates and is O(num_row_groups * num_columns)
Having a single row group drops the latency down to 2ms even with 10,000 columns. Even if each column only has a single 1MB page, the file will be >10GB even with a single row group

Phrasing the above differently, assuming single page ColumnChunks of 1MB, we can parse the metadata for a column chunk in 450ns. This means for a very large 10GB file we can still parse the metadata within single-digit milliseconds.

emkornfield · 2024-05-17T00:16:21Z

Nice work. I do think #5770 (comment) is probably the key, that statistics would make things considerably worse, adjusting defaults across implementations probably make sense. I'm not sure what Java is set at but I think C++ will always write all statistics.

marcin-krystianc · 2024-05-17T08:34:43Z

10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong).

My experience has been either:

The files they make are much smaller (and thus not enough or undersized row groups) E.g. financial data where ticker is column.

The columns are very sparse (and thus a need for better sparse encoding). E.g. feature stores.

Hi, we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns.
That makes the cost of reading the entire metadata footer even higher than cost of reading the actual data (because we read entire footer and then read only tiny subset of the actual data).

To deal with the problem we've implemented a tool that stores an index information in a separate file which allows for reading only a necessary subset of metadata. (https://github.com/G-Research/PalletJack)..

tustvold · 2024-05-17T08:58:40Z

we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns.

How big are these files? At 1MB per page (the recommended size) this would come to 5TB?!

For extremely sparse data, such as feature stores, I can't help wondering if MapArray or similar would be a better way to encode this. You would lose the ability to do projection pushdown in the traditional sense, but maybe this is ok?

marcin-krystianc · 2024-05-17T10:48:48Z

How big are these files? At 1MB per page (the recommended size) this would come to 5TB?!

I think file size depends on the number of rows per row group. With 100 row groups, 50k columns and single row per row group the file has 700MB (no compression, default page size, no stats).
We use files with tens of thousands rows per row group, and the files sizes about 1TB.

tustvold · 2024-05-17T12:04:23Z

the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

thinkharderdev · 2024-05-17T15:11:10Z

the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

1TB does seem to be quite large but using smaller files requires reading more metadata (with their associated IO). In out system we write files in the range of 50-300MB and reading metadata was so expensive on large queries (we measured it as taking ~30% of total query processing time in some cases) that we built an entire separate system similar to https://github.com/G-Research/PalletJack to deal with it.

tustvold · 2024-05-17T15:16:43Z

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense, at least to me, that applications would want to build additional metadata structures over the top of collections of parquet files, that are then optimised for their particular read/write workloads?

thinkharderdev · 2024-05-17T15:21:44Z

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense to me that applications would want to build additional metadata structures over the top of collections of parquet files that are optimised for their particular read/write workloads, and that by design these would not be a part of the storage format itself?

I'm not sure that's right. A catalog can do many things and certainly some of those things don't belong in the storage format (eg grouping individual parquet files together into to some logical group relevant to the query or something), but if the catalog is just duplicating directly data from the parquet footer because reading the footer is too expensive that seems like something that should be addressed in the storage format itself.

tustvold · 2024-05-17T15:25:10Z

duplicating directly data from the parquet footer because reading the footer is too expensive

But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files.

Or at least that's the theory...

thinkharderdev · 2024-05-17T15:47:10Z

duplicating directly data from the parquet footer because reading the footer is too expensive

But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files.

Or at least that's the theory...

Right, but there are two different levels here. You can store statistics that allow you to prune based only on metadata, but what's left you just have to go and scan. And the scanning requires reading the column chunk metadata. So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up.

How much of that cost is related to thrift decoding specifically? I think not much, and the real issue is just IO. But if you could read a file-specific metadata header that allows you to prune IO by reading only the column chunk metadata you actually need (whether it is encoded using thrift or flatbuffers or protobuf or XYZ) then I think that could definitely help a lot.

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

tustvold · 2024-05-17T15:53:59Z

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds

So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up

Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅

thinkharderdev · 2024-05-17T16:01:05Z

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds

So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up

Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅

Nimble (the new Meta storage format) has an interesting approach to this. From what I understand they store most of the data that would go in the footer in parquet inline in the column chunk itself. So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about. But they also don't support predicate pushdown and so once you add that in you end with more IOPS to prune data pages (I think)

mapleFU · 2024-05-17T16:16:00Z

So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about

This sounds like "drop column chunk metadata" in this section? #5770 (comment) This can be enabled also in standard parquet then?

thinkharderdev · 2024-05-17T17:05:59Z

So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about

This sounds like "drop column chunk metadata" in this section? #5770 (comment) This can be enabled also in standard parquet then?

I don't think so because you need the column chunk metadata to actually read the column.

mapleFU · 2024-05-17T17:16:13Z

I don't think so because you need the column chunk metadata to actually read the column.

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863-L867 ColumnChunk is neccessary, but ColumnChunkMeta is optional. This might making io-estimating a bit tricky but it making ColumnChunkMeta not in footer? And you're right because we need the column chunk metadata to actually read the column.

thinkharderdev · 2024-05-17T17:55:55Z

I don't think so because you need the column chunk metadata to actually read the column.

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863-L867 ColumnChunk is neccessary, but ColumnChunkMeta is optional. This might making io-estimating a bit tricky but it making ColumnChunkMeta not in footer? And you're right because we need the column chunk metadata to actually read the column.

Yeah, you're right and I think I just misunderstood the original comment (it gets confusing between the rust types in parquet and the thrift structs defined by the parquet thrift IDL :)). But I guess you could just have the column chunk as the bare minimum of just an offset and path and then just do decoding purely based on the data page headers.

tustvold · 2024-05-18T14:41:42Z

I created a thread on the parquet mailing list to get some other perspectives on this topic, as I think it is fundamental to the expectations of parquet's metadata - https://lists.apache.org/thread/lv5gtfsbqhs7nrxclrq97nt50msxdst2

marcin-krystianc · 2024-05-20T08:56:53Z

the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

If the size of entire dataset is tens of TBs then you need to use larger files or you end up having a lot of files (many thousands) which is hard to manage. Also, thinking about te future of ML, it is clear to me that a requirement for even larger datasets is not unrealistic.

marcin-krystianc · 2024-05-20T09:08:10Z

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense, at least to me, that applications would want to build additional metadata structures over the top of collections of parquet files, that are then optimised for their particular read/write workloads?

PalletJack is on a lower level than catalogs like Hive, Deltalake or Iceberg. It is designed for use with a individual parquet files for a specific use case of "To be able to decode/parse only a minimum amount of parquet metadata to be able to read the requested sample of data from a parquet file".

tustvold · 2024-05-20T09:55:54Z

Catalogs are designed to easily scale to PBs across potentially millions of files, if scalability is the goal an approach tied to single files will inevitably hit a scaling wall

To be able to decode/parse only a minimum amount of parquet metadata to be able to read the requested sample of data from a parquet file

This what catalogs are designed to provide, but can do so across file boundaries, significantly improving performance and scalability

alamb · 2024-05-28T14:47:59Z

@XiangpengHao has kindly agreed to do some more analysis and a writeup

XiangpengHao · 2024-05-28T15:13:45Z

I'm happy to experiment with this, please assign me!

alamb · 2024-05-28T15:43:09Z

@XiangpengHao and @tustvold and @wiedld spoke a little about this, and here was my understanding of the plan:

We will run experiments (perhaps based on @tustvold 's tool from #5770 (comment))

Specifically create these scenario:

Create schemas with 1000 columns, 5k columns, 10k columns 20k columns, (maybe 100k columns )
Schema all floats (model a machine learning usecase),
Write a parquet file with 10M rows, with 1M row row groups
Try with three different writer settings: 1. default (use the rust writer defaults) , 2. turn off all statistics, etc to minimize the size of the metadata, 3. maximum statistics (turn on full statistics, including page level statistics, don't truncate statistics lenghts, etc

The for each scenario, measure:

Size of metadata footer (in bytes)
Time to decode (time to read ParquetMetaData file)

This should result in a table like for each of "metadata size in bytes" and "decode perfor

Writer Properties	1k columns	2k coumns	5k columns	10k columns	20k columns
Default Properties	X	X	X	X	X
Minimum Statistics	X	X	X	X	X
Maximum Statistics	X	X	X	X	X

Other potential experiments to run:

Add string/binary columns to the schema (maybe 10% of the columns) -- I expect to see the metadata be much larger
Try other writer settings (e.g. show the effect of metadata truncation)

alamb · 2024-06-10T17:58:36Z

Is anyone willing to review a draft blog post on this subject? Perhaps @jhorstmann ?

If so please email me at alamb@influxdata.com

alamb · 2024-06-19T17:25:38Z

Blog is published: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ 🎉

alamb · 2024-07-02T18:22:03Z

label_issue.py automatically added labels {'documentation'} from #5863

alamb · 2024-07-02T18:22:06Z

label_issue.py automatically added labels {'parquet'} from #5863

alamb · 2024-07-02T18:22:09Z

label_issue.py automatically added labels {'arrow'} from #5798

adriangb · 2024-07-13T18:07:55Z

I'll point out that we've had good results by thinking about what sorts of queries we want to run on our data and only storing stats for those columns. It's a trivial optimization, similar to not creating an index for every column in a OLTP database, but I don't see it discussed / a lot of high level writers don't give you that choice.

alamb added the enhancement Any new improvement worthy of a entry in the changelog label May 16, 2024

tustvold mentioned this issue May 16, 2024

Avoid Allocating When Decoding Thrift #5777

Closed

tustvold mentioned this issue May 16, 2024

DRAFT: Parquet 3 metadata with decoupled column metadata apache/parquet-format#242

Draft

3 tasks

alamb assigned XiangpengHao May 28, 2024

This was referenced May 30, 2024

Limit Parquet Page Row Count By Default to reduce writer memory requirements with highly compressable columns #5797

Closed

Refine documentation for unary_mut and binary_mut #5798

Merged

Improve memory overhead of parquet dictionary encoder #5828

Open

This was referenced Jun 7, 2024

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Open

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

Open

XiangpengHao mentioned this issue Jun 7, 2024

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Open

alamb mentioned this issue Jun 10, 2024

Improve Parquet reader/writer properties docs #5863

Merged

jhorstmann mentioned this issue Jun 11, 2024

Benchmarks for custom parquet format #5869

Closed

alamb mentioned this issue Jun 13, 2024

Minor: Clarify when page index structures are read #5886

Merged

alamb closed this as completed Jun 19, 2024

alamb added the documentation Improvements or additions to documentation label Jul 2, 2024

alamb added the parquet Changes to the parquet crate label Jul 2, 2024

alamb added the arrow Changes to the arrow crate label Jul 2, 2024

alamb mentioned this issue Jul 4, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Comments

alamb commented May 16, 2024

alamb commented May 16, 2024

alamb commented May 16, 2024

westonpace commented May 16, 2024 • edited Loading

westonpace commented May 16, 2024

mapleFU commented May 16, 2024

alamb commented May 16, 2024

tustvold commented May 16, 2024

tustvold commented May 16, 2024

tustvold commented May 16, 2024 • edited Loading

emkornfield commented May 17, 2024 • edited by tustvold Loading

marcin-krystianc commented May 17, 2024

tustvold commented May 17, 2024

marcin-krystianc commented May 17, 2024

tustvold commented May 17, 2024

thinkharderdev commented May 17, 2024

tustvold commented May 17, 2024 • edited Loading

thinkharderdev commented May 17, 2024

tustvold commented May 17, 2024 • edited Loading

thinkharderdev commented May 17, 2024

tustvold commented May 17, 2024 • edited Loading

thinkharderdev commented May 17, 2024

mapleFU commented May 17, 2024

thinkharderdev commented May 17, 2024

mapleFU commented May 17, 2024 • edited Loading

thinkharderdev commented May 17, 2024

tustvold commented May 18, 2024

marcin-krystianc commented May 20, 2024

marcin-krystianc commented May 20, 2024

tustvold commented May 20, 2024 • edited Loading

alamb commented May 28, 2024

XiangpengHao commented May 28, 2024

alamb commented May 28, 2024 • edited Loading

alamb commented Jun 10, 2024 • edited Loading

alamb commented Jun 19, 2024

alamb commented Jul 2, 2024

alamb commented Jul 2, 2024

alamb commented Jul 2, 2024

adriangb commented Jul 13, 2024

westonpace commented May 16, 2024 •

edited

Loading

tustvold commented May 16, 2024 •

edited

Loading

emkornfield commented May 17, 2024 •

edited by tustvold

Loading

tustvold commented May 17, 2024 •

edited

Loading

tustvold commented May 17, 2024 •

edited

Loading

tustvold commented May 17, 2024 •

edited

Loading

mapleFU commented May 17, 2024 •

edited

Loading

tustvold commented May 20, 2024 •

edited

Loading

alamb commented May 28, 2024 •

edited

Loading

alamb commented Jun 10, 2024 •

edited

Loading