[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

asfimport · 2021-02-26T11:27:42Z

The distinct_count attribute of the column chunk metadata statistics is broken: It always shows 0. This seems to be the case for all types of columns. Checked with int64 as well as dictionary encoded string columns:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pydict({
    'foo': pa.array(['ABC', 'DEF']).dictionary_encode()
})

pq.write_table(table, 'test_row_group_statistics.parquet', version='2.0', data_page_version='2.0')

pq_file = pq.ParquetFile('test_row_group_statistics.parquet')
print(pq_file.metadata.row_group(0).column(0).statistics)

Output:


<pyarrow._parquet.Statistics object at 0x0000020A1699D770>
  has_min_max: True
  min: ABC
  max: DEF
  null_count: 0
  distinct_count: 0
  num_values: 2
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Reporter: ARF / @ARF1

_{Note: This issue was originally created as ARROW-11793. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-04-19T18:23:06Z

David Li / @lidavidm:
From a quick grep it seems it's just never set by the writer in the first place. Possibly a Parquet file not written by Arrow C++ would set this.

asfimport · 2022-09-15T15:36:42Z

David Rauschenbach:
I think there's more to this. When looking at the tpc-h file customer_0_0.parquet, parquet-tools shows null counts yet Arrow reports HasNullCount() == FALSE.

 $ parquet-tools inspect tpch/customer_0_0.parquet --detail
...
■■■■■■■■■■■■■■■■ColumnChunk
■■■■■■■■■■■■■■■■■■■■file_offset = 184618
■■■■■■■■■■■■■■■■■■■■meta_data = ColumnMetaData
■■■■■■■■■■■■■■■■■■■■■■■■type = 6
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
■■■■■■■■■■■■■■■■■■■■■■■■■■■■4
■■■■■■■■■■■■■■■■■■■■■■■■path_in_schema = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■c_address
■■■■■■■■■■■■■■■■■■■■■■■■num_values = 7500
■■■■■■■■■■■■■■■■■■■■■■■■total_uncompressed_size = 207320
■■■■■■■■■■■■■■■■■■■■■■■■total_compressed_size = 207320
■■■■■■■■■■■■■■■■■■■■■■■■data_page_offset = 184618
■■■■■■■■■■■■■■■■■■■■■■■■statistics = Statistics
■■■■■■■■■■■■■■■■■■■■■■■■■■■■max = b'zyWvi,SGc,tXTls'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■min = b'   5L06W67,Mw8G'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■null_count = 469

mapleFU · 2023-07-11T15:41:55Z

@pitrou @wgtmac Seems this issue could be closed because #35989 is merged?

pitrou · 2023-07-11T15:54:58Z

@mapleFU It seems solved, yes.

marcin-krystianc · 2024-05-28T12:09:45Z

Hi, I've tried this in recent pyarrow (v16.1.0), and I think it is still broken.
My test code:

import unittest
import tempfile

import pyarrow.parquet as pq
import pyarrow as pa
import os

def get_table():

    pa_arrays = [[1.0, 2.0], [1, 2]]

    column_names = ["c0", "c1"]

    # Create a PyArrow Table from the Arrays
    return pa.Table.from_arrays(pa_arrays, names=column_names)

class TestStatistics(unittest.TestCase):

    def test_inmemory_index_data(self):
        with tempfile.TemporaryDirectory(ignore_cleanup_errors=True) as tmpdirname:
            path = os.path.join(tmpdirname, "my.parquet")
            table = get_table()

            pq.write_table(table, path, write_statistics=True)

            pr = pq.ParquetReader()
            pr.open(path)
            float_column = pr.metadata.row_group(0).column(0)
            int_column = pr.metadata.row_group(0).column(1)
            
            print (pa.__version__)
            print (int_column.statistics)
            print (float_column.statistics)
            self.assertEqual(int_column.physical_type, 'INT64')
            self.assertEqual(int_column.statistics.has_min_max, True)
            self.assertEqual(int_column.statistics.has_distinct_count, True)
            
            self.assertEqual(float_column.physical_type, 'DOUBLE')
            self.assertEqual(float_column.statistics.has_min_max, True)
            self.assertEqual(float_column.statistics.has_distinct_count, True)
           
if __name__ == '__main__':
    unittest.main()

Output:

16.1.0
<pyarrow._parquet.Statistics object at 0x7ff97eedd530>
  has_min_max: True
  min: 1
  max: 2
  null_count: 0
  distinct_count: None
  num_values: 2
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE
<pyarrow._parquet.Statistics object at 0x7ff97fa3b920>
  has_min_max: True
  min: 1.0
  max: 2.0
  null_count: 0
  distinct_count: None
  num_values: 2
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE

Can you clarify whether it is expected to work? Is there any option to set to enable distinct count calculation?

mapleFU · 2024-05-28T13:27:47Z

@marcin-krystianc This minor fixes the problem that distinct_count == 0, and None means we didn't has the distinct_count. They share the different syntax.

Besides, it's a bit tricky to maintaining the distinct count:

For page-level statistics, maintaining a distinct count among pages might be hard
For column-chunk statistics, only for dictionary column without fallback to plain should have exact "distinct_count" here

Currently implementation doesn't calc any distinct count here.

mapleFU · 2024-05-28T15:37:20Z

We can tracking on #36505 for the further development

pitrou closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

asfimport commented Feb 26, 2021

asfimport commented Apr 19, 2022

asfimport commented Sep 15, 2022

mapleFU commented Jul 11, 2023

pitrou commented Jul 11, 2023

marcin-krystianc commented May 28, 2024 •

edited

Loading

mapleFU commented May 28, 2024 •

edited

Loading

mapleFU commented May 28, 2024

[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

[C++][Parquet] dictinct_count broken (always 0) in column chunk statistics #27644

Comments

asfimport commented Feb 26, 2021

asfimport commented Apr 19, 2022

asfimport commented Sep 15, 2022

mapleFU commented Jul 11, 2023

pitrou commented Jul 11, 2023

marcin-krystianc commented May 28, 2024 • edited Loading

mapleFU commented May 28, 2024 • edited Loading

mapleFU commented May 28, 2024

marcin-krystianc commented May 28, 2024 •

edited

Loading

mapleFU commented May 28, 2024 •

edited

Loading