Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet-layout binary #3269

Merged
merged 4 commits into from
Dec 5, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 4, 2022

Which issue does this PR close?

Closes #.

Rationale for this change

Frequently when debugging an issue the first port of call is working out the physical layout of the data in the parquet file, what indexes are present, what encodings are being used, how large the pages are, etc...

I have been unable to find such a tool, so I quickly wrote one up to replace the ad-hoc code I keep having to write 😅

parquet-testing/data/nested_lists.snappy.parquet
{
  "row_groups": [
    {
      "columns": [
        {
          "path": "a.list.element.list.element.list.element",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": "snappy",
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 30,
              "uncompressed_bytes": 30,
              "header_bytes": 13,
              "num_values": 6
            },
            {
              "compression": "snappy",
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 34,
              "uncompressed_bytes": 33,
              "header_bytes": 27,
              "num_values": 18
            }
          ]
        },
        {
          "path": "b",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": "snappy",
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 6,
              "uncompressed_bytes": 4,
              "header_bytes": 13,
              "num_values": 1
            },
            {
              "compression": "snappy",
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 4,
              "uncompressed_bytes": 2,
              "header_bytes": 33,
              "num_values": 3
            }
          ]
        }
      ]
    }
  ]
}
parquet-testing/data/data_index_bloom_encoding_stats.parquet
{
  "row_groups": [
    {
      "columns": [
        {
          "path": "String",
          "has_offset_index": true,
          "has_column_index": true,
          "has_bloom_filter": true,
          "pages": [
            {
              "compression": "gzip",
              "encoding": "plain",
              "page_type": "data_page_v1",
              "compressed_bytes": 127,
              "uncompressed_bytes": 138,
              "header_bytes": 25,
              "num_values": 14
            }
          ]
        }
      ]
    }
  ]
}
parquet-testing/data/alltypes_dictionary.parquet
{
  "row_groups": [
    {
      "columns": [
        {
          "path": "id",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 8,
              "uncompressed_bytes": 8,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "bool_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain",
              "page_type": "data_page_v1",
              "compressed_bytes": 7,
              "uncompressed_bytes": 7,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "tinyint_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 8,
              "uncompressed_bytes": 8,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "smallint_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 8,
              "uncompressed_bytes": 8,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "int_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 8,
              "uncompressed_bytes": 8,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "bigint_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 16,
              "uncompressed_bytes": 16,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "float_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 8,
              "uncompressed_bytes": 8,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "double_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 16,
              "uncompressed_bytes": 16,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "date_string_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 12,
              "uncompressed_bytes": 12,
              "header_bytes": 13,
              "num_values": 1
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "string_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 10,
              "uncompressed_bytes": 10,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        },
        {
          "path": "timestamp_col",
          "has_offset_index": false,
          "has_column_index": false,
          "has_bloom_filter": false,
          "pages": [
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "dictionary",
              "compressed_bytes": 24,
              "uncompressed_bytes": 24,
              "header_bytes": 13,
              "num_values": 2
            },
            {
              "compression": null,
              "encoding": "plain_dictionary",
              "page_type": "data_page_v1",
              "compressed_bytes": 9,
              "uncompressed_bytes": 9,
              "header_bytes": 17,
              "num_values": 2
            }
          ]
        }
      ]
    }
  ]
}

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold requested a review from alamb December 4, 2022 21:08
@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 4, 2022
@alamb
Copy link
Contributor

alamb commented Dec 5, 2022

I will review this tomorrow -- the other one I know of is https://github.com/manojkarthick/pqrs which shows promise

// specific language governing permissions and limitations
// under the License.

//! Binary that prints the physical layout of a parquet file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

As a general musing, I think some of these cli helpers are quite nice and maybe eventually we could make them more discoverable / nicer to people who are not working with the parquet source code.

I didn't find anything other than https://github.com/apache/arrow-rs/tree/master/parquet/src/bin for documentation

Perhaps to prqs or something similar 🤔


let end = start + column.compressed_size() as u64;
while start != end {
let (header_len, header) = read_page_header(reader, start)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is neat

num_values: data_page.num_values,
})
} else if let Some(data_page) = header.data_page_header_v2 {
let is_compressed = data_page.is_compressed.unwrap_or(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really compressed by default? I expected unwrap_or(false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column is compressed, and the header doesn't specify, the default is compressed. I think this was a later extension to make it optional

@tustvold tustvold merged commit 94d597e into apache:master Dec 5, 2022
@ursabot
Copy link

ursabot commented Dec 5, 2022

Benchmark runs are scheduled for baseline = b155461 and contender = 94d597e. 94d597e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants