[C++] Parquet files with fp16 columns should be faster to read than fp32 #43891

marcin-krystianc · 2024-08-30T13:31:23Z

Describe the enhancement requested

We want to start using fp16 data for our Ml workflows. We hoped for disk space savings, reduced RAM consumption, and doubled reading performance. Parquet files with fp16 columns are indeed smaller, but the reading performance is worse than that of fp32 files.
I think that the reason for that non-optimal performance is the usage of the FIXED_LEN_BYTE_ARRAY physical type for fp16 values. It makes the reader to memcpy a single fp16 value at a time.

Test code

import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np
import humanize
import time
import os

row_groups = 1
n_columns = 7_000
chunk_size = 64_000
n_rows = row_groups * chunk_size
work_items = 2

parquet_path = "/tmp/my.parquet"

def get_table(n_rows, n_columns, data_type = pa.float32()):
    # Generate a random 2D array of floats using NumPy
    # Each column in the array represents a column in the final table
    data = np.random.rand(n_rows, n_columns).astype(np.float32)

    # Convert the NumPy array to a list of PyArrow Arrays, one for each column
    pa_arrays = [pa.array(data[:, i]).cast(data_type, safe = False) for i in range(n_columns)]
    schema = pa.schema([(f'column_{i}', data_type) for i in range(n_columns)])
    # Create a PyArrow Table from the Arrays
    return pa.Table.from_arrays(pa_arrays, schema=schema)

def worker_arrow_row_group():
    pr = pq.ParquetReader()
    pr.open(parquet_path, pre_buffer=True)
    pr.read_row_groups(range(row_groups), use_threads=False)

def genrate_data(n_rows, n_columns, path, compression, dtype):

    table = get_table(n_rows, n_columns, dtype)

    t = time.time()
    print(f"writing parquet file:{path}, columns={n_columns}, row_groups={row_groups}, rows={n_rows}, compression={compression}, dtype={dtype}")
    
    pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, compression=compression, store_schema=False)
    parquet_size = os.stat(path).st_size
    print(f"Parquet size={humanize.naturalsize(parquet_size)}")

    dt = time.time() - t
    print(f"finished writing parquet file in {dt:.2f} seconds")

def measure_reading(worker):

    tt = []
    # measure multiple times and take the fastest run
    for _ in range(0, 11):
        t = time.time()
        worker()
        tt.append(time.time() - t)

    return min(tt)

for dtype in [pa.float32(), pa.float16()]:
    
    print(f".")
    genrate_data(n_rows, n_columns, path = parquet_path, compression = None, dtype = dtype)
    print(f"`ParquetReader.read_row_groups`, dtype:{dtype}, duration:{measure_reading(worker_arrow_row_group):.2f} seconds")

Results:

writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=float
Parquet size=1.8 GB
finished writing parquet file in 2.00 seconds
`ParquetReader.read_row_groups`, dtype:float, duration:0.88 seconds

writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=halffloat
Parquet size=897.2 MB
finished writing parquet file in 2.20 seconds
`ParquetReader.read_row_groups`, dtype:halffloat, duration:2.49 seconds

Component(s)

C++, Python

The text was updated successfully, but these errors were encountered:

pitrou · 2024-09-04T14:51:56Z

Thanks for reporting this and thanks a lot for the reproducer. It's true that reading FIXED_LEN_BYTE_ARRAY is currently much less efficient than it could be.

marcin-krystianc added the Type: enhancement label Aug 30, 2024

github-actions bot added Component: C++ Component: Python labels Aug 30, 2024

pitrou self-assigned this Sep 4, 2024

pitrou changed the title ~~PyArrow: Parquet files with fp16 columns should be faster to read than fp32~~ [C++] Parquet files with fp16 columns should be faster to read than fp32 Sep 4, 2024

pitrou mentioned this issue Sep 11, 2024

[C++][Parquet] Add Float16 reading benchmarks #44072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Parquet files with fp16 columns should be faster to read than fp32 #43891

[C++] Parquet files with fp16 columns should be faster to read than fp32 #43891

marcin-krystianc commented Aug 30, 2024

pitrou commented Sep 4, 2024

[C++] Parquet files with fp16 columns should be faster to read than fp32 #43891

[C++] Parquet files with fp16 columns should be faster to read than fp32 #43891

Comments

marcin-krystianc commented Aug 30, 2024

Describe the enhancement requested

Component(s)

pitrou commented Sep 4, 2024