Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

Closed
asfimport opened this issue Feb 26, 2021 · 7 comments
Closed

Comments

@asfimport
Copy link
Collaborator

The distinct_count attribute of the column chunk metadata statistics is broken: It always shows 0. This seems to be the case for all types of columns. Checked with int64 as well as dictionary encoded string columns:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pydict({
    'foo': pa.array(['ABC', 'DEF']).dictionary_encode()
})

pq.write_table(table, 'test_row_group_statistics.parquet', version='2.0', data_page_version='2.0')

pq_file = pq.ParquetFile('test_row_group_statistics.parquet')
print(pq_file.metadata.row_group(0).column(0).statistics)

Output:


<pyarrow._parquet.Statistics object at 0x0000020A1699D770>
  has_min_max: True
  min: ABC
  max: DEF
  null_count: 0
  distinct_count: 0
  num_values: 2
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Reporter: ARF / @ARF1

Note: This issue was originally created as ARROW-11793. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
From a quick grep it seems it's just never set by the writer in the first place. Possibly a Parquet file not written by Arrow C++ would set this.

@asfimport
Copy link
Collaborator Author

David Rauschenbach:
I think there's more to this. When looking at the tpc-h file customer_0_0.parquet, parquet-tools shows null counts yet Arrow reports HasNullCount() == FALSE.

 

 $ parquet-tools inspect tpch/customer_0_0.parquet --detail
...
■■■■■■■■■■■■■■■■ColumnChunk
■■■■■■■■■■■■■■■■■■■■file_offset = 184618
■■■■■■■■■■■■■■■■■■■■meta_data = ColumnMetaData
■■■■■■■■■■■■■■■■■■■■■■■■type = 6
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
■■■■■■■■■■■■■■■■■■■■■■■■■■■■4
■■■■■■■■■■■■■■■■■■■■■■■■path_in_schema = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■c_address
■■■■■■■■■■■■■■■■■■■■■■■■num_values = 7500
■■■■■■■■■■■■■■■■■■■■■■■■total_uncompressed_size = 207320
■■■■■■■■■■■■■■■■■■■■■■■■total_compressed_size = 207320
■■■■■■■■■■■■■■■■■■■■■■■■data_page_offset = 184618
■■■■■■■■■■■■■■■■■■■■■■■■statistics = Statistics
■■■■■■■■■■■■■■■■■■■■■■■■■■■■max = b'zyWvi,SGc,tXTls'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■min = b'   5L06W67,Mw8G'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■null_count = 469

@mapleFU
Copy link
Member

mapleFU commented Jul 11, 2023

@pitrou @wgtmac Seems this issue could be closed because #35989 is merged?

@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

@mapleFU It seems solved, yes.

@pitrou pitrou closed this as completed Jul 11, 2023
@marcin-krystianc
Copy link

marcin-krystianc commented May 28, 2024

Hi, I've tried this in recent pyarrow (v16.1.0), and I think it is still broken.
My test code:

import unittest
import tempfile

import pyarrow.parquet as pq
import pyarrow as pa
import os

def get_table():

    pa_arrays = [[1.0, 2.0], [1, 2]]

    column_names = ["c0", "c1"]

    # Create a PyArrow Table from the Arrays
    return pa.Table.from_arrays(pa_arrays, names=column_names)

class TestStatistics(unittest.TestCase):

    def test_inmemory_index_data(self):
        with tempfile.TemporaryDirectory(ignore_cleanup_errors=True) as tmpdirname:
            path = os.path.join(tmpdirname, "my.parquet")
            table = get_table()

            pq.write_table(table, path, write_statistics=True)

            pr = pq.ParquetReader()
            pr.open(path)
            float_column = pr.metadata.row_group(0).column(0)
            int_column = pr.metadata.row_group(0).column(1)
            
            print (pa.__version__)
            print (int_column.statistics)
            print (float_column.statistics)
            self.assertEqual(int_column.physical_type, 'INT64')
            self.assertEqual(int_column.statistics.has_min_max, True)
            self.assertEqual(int_column.statistics.has_distinct_count, True)
            
            self.assertEqual(float_column.physical_type, 'DOUBLE')
            self.assertEqual(float_column.statistics.has_min_max, True)
            self.assertEqual(float_column.statistics.has_distinct_count, True)
           
if __name__ == '__main__':
    unittest.main()

Output:

16.1.0
<pyarrow._parquet.Statistics object at 0x7ff97eedd530>
  has_min_max: True
  min: 1
  max: 2
  null_count: 0
  distinct_count: None
  num_values: 2
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE
<pyarrow._parquet.Statistics object at 0x7ff97fa3b920>
  has_min_max: True
  min: 1.0
  max: 2.0
  null_count: 0
  distinct_count: None
  num_values: 2
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE

Can you clarify whether it is expected to work? Is there any option to set to enable distinct count calculation?

@mapleFU
Copy link
Member

mapleFU commented May 28, 2024

@marcin-krystianc This minor fixes the problem that distinct_count == 0, and None means we didn't has the distinct_count. They share the different syntax.

Besides, it's a bit tricky to maintaining the distinct count:

  1. For page-level statistics, maintaining a distinct count among pages might be hard
  2. For column-chunk statistics, only for dictionary column without fallback to plain should have exact "distinct_count" here

Currently implementation doesn't calc any distinct count here.

@mapleFU
Copy link
Member

mapleFU commented May 28, 2024

We can tracking on #36505 for the further development

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants