Skip to content

IO Bench is a library designed to benchmark the performance of standard flat file formats and partitioning schemes.

License

Notifications You must be signed in to change notification settings

aastopher/io_bench

Repository files navigation

PyPI version Documentation Status codecov DeepSource

IOBench Quick Start Guide

Generating Sample Data

To generate sample data, initialize the IOBench object with the path to the source CSV file and call the generate_sample method:

from io_bench import IOBench

bench = IOBench(source_file='./data/source_100K.csv', runs=20, parsers=['avro', 'parquet_polars', 'parquet_arrow', 'parquet_fast', 'feather', 'feather_arrow'])
bench.generate_sample(records=100000) # default value

NOTE: source_file behavior is contextual; providing a desired name for a sample file then calling generate_sample will create the file. Otherwise a valid path to an existing file must be provided.

Converting Data to Partitioned Formats

Convert the generated CSV data to partitioned formats (Avro, Parquet, Feather) will automatically partition on default column selection chunks if not defined.

bench.partition(rows={'avro': 500000, 'parquet': 3000000, 'feather': 1600000})

Running Benchmarks

NOTE: Partition is stateful per bench object. If partition is not called manually it will automatically be called on the first run only assuming a valid source file exists.

Without Column Selection

Run benchmarks without column selection:

benchmarks_no_select = bench.run(suffix='_no_select')

With Column Selection

Run benchmarks with column selection:

columns = ['Region', 'Country', 'Total Cost']
benchmarks_column_select = bench.run(columns=columns, suffix='_column_select')

Generating Reports

Combine results and generate the final report:

all_benchmarks = benchmarks_no_select + benchmarks_column_select
io_bench.report(all_benchmarks, report_dir='./result')

Full Example

Here is a full example of using IOBench:

from io_bench import IOBench

def main() -> None:
    # Initialize the IOBench object with runs and parsers
    bench = IOBench(source_file='./data/source_100K.csv', runs=20, parsers=['avro', 'parquet_polars'])

    # Generate sample data - (optional)
    bench.generate_sample()

    # Convert the source file to partitioned formats - (optional)
    bench.partition(rows={'avro': 500000, 'parquet': 3000000, 'feather': 1600000})

    # Run benchmarks without column selection
    benchmarks_no_select = bench.run(suffix='_no_select')

    # Run benchmarks with column selection
    columns = ['Region', 'Country', 'Total Cost']
    benchmarks_column_select = bench.run(columns=columns, suffix='_column_select')

    # Combine results and generate the final report
    all_benchmarks = benchmarks_no_select + benchmarks_column_select
    bench.report(all_benchmarks, report_dir='./result')

if __name__ == "__main__":
    main()

About

IO Bench is a library designed to benchmark the performance of standard flat file formats and partitioning schemes.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages