An extension to Apache Arrow based on their CSV reader for reading fixed-width files (tabular data where all fields in a column are padded to the same number of bytes) to Arrow tables. Written in C++, with Python bindings. Supports various encodings.
For an example of how to use this project, see notebooks/fwfr_sample_usage.ipynb.
Note: core components rely on Apache Arrow, and requires version 0.13 and 0.14. The installation script pulls in version 0.14. If the newest version of Apache Arrow, 0.15, is in the Conda environment, it will cause the installation to fail. I will be updating this project soon to account for these changes, probably in line with Apache Arrow's 1.0 release, when their internal structures will be more stable.
Install and build everything from source into your active Conda environment. Any missing dependencies are installed in Conda. Unit tests run automatically after installation.
conda create -n env
conda activate env
git clone https://github.com/kira-noel/fwfr.git
cd fwfr
./install.sh --source
If you want to use the C++ library without Python, the installation script also installs libfwfr.so and headers in $CONDA_PREFIX/{lib,include/fwfr}.
Note: if modifying setup.py (the file used to build the Python bindings), note that you cannot use distutils. Wheel uses these. Instead, use setuptools.
Main Python module.
Options for parsing FWF data.
field_widths: int list, required
The number of bytes in each field in a column of FWF data.
ignore_empty_lines: bool, optional (default True)
Whether empty lines are ignored in FWF input.
skip_columns: int list, optional (default empty)
Indexes of columns to skip on read-in.
import pyfwfr as pf
parse_options = pf.ParseOptions([6, 6, 6, 4], ignore_empty_lines=True, [0, 1, 6])
parse_options.field_widths # displays [6, 6, 6, 4]
parse_options.field_widths = [4, 4, 4, 4]
parse_options.field_widths # displays [4, 4, 4, 4]
Options for reading FWF data.
encoding: string, optional (default "")
The encoding on the input data, if any. General note: the encoding names are flexible (case, dashes, etc.).
Look https://demo.icu-project.org/icu-bin/convexp for a list of supported aliases. EBCDIC note: must
append ',swaplfnl' ('cp1047' --> 'cp1047,swaplfnl'). EBCDIC encodings swap the order of carriage return and newline.
use_threads: bool, optional (default True)
Whether to use multiple thread to accelerate reading.
block_size: int, optional (default 1MB)
How many bytes to process at a time from the input stream. This will determine multi-threading granularity as well as the size of individual chunks in the table
skip_rows: int, optional (deafult 0)
Number of rows to skip at the beginning of the input stream.
column_names: list, optional
Column names (if empty, will attempt to read from first row after 'skip_rows').
import pyfwfr as pf
read_options = pf.ReadOptions(encoding="cp500,swaplfnl", use_threads=True, block_size=1024)
Options for converting FWF data.
column_types: dict, optional
Map column names to column types (disables type inferencing on those columns.
is_cobol: bool, optional (deafult False)
Whether to check for COBOL-formatted numeric types. Uses values provided in pos_values and neg_values
for the conversion.
pos_values: dict, optional (default mapping provided)
COBOL values for interpreting positive numeric values.
neg_values: dict, optional (default mapping provided)
COBOL values for interpreting negative numeric values.
null_values: list, optional
A sequence of string that denote nulls in the data (defaults are appropriate in most cases).
true_values: list, optional
A sequence of string that denote true booleans in the data (defaults are appropriate in most cases).
false_values: list, optional
A sequence of string that denote true booleans in the data (defaults are appropriate in most cases).
strings_can_be_null: bool, optional (default False)
Whether string/binary columns can have null values. If true, then strings in null_values are considered null for string columns. If false, then all strings are valid string values.
import pyfwfr as pf
convert_options = pf.ConvertOptions()
Read a Table from a stream of FWF data. Must set parse_options.field_widths!
input_file: string, path or file-like object
parse_options: fwf.ParseOptions, required
read_options: fwf.ReadOptions, optional
convert_options: fwf.ConvertOptions, optional
memory_pool: MemoryPool, optional
import pyfwfr as pf
parse_options = pf.ParseOptions([6, 6, 6, 4])
read_options = pf.ReadOptions(encoding="big5")
table = pf.read_fwf(filename, parse_options, read_options=read_options)
Return absolute path to libfwfr.so, the C++ base library.
Return absolute path to C++ headers.
Current included tests:
- test_big: threaded-read a large (big enough to use chunker) UTF8 dataset.
- test_big_encoded: threaded-read a large (big enough to use chunker) big5-encoded dataset.
- test_cobol: ensure column type and conversion for numeric COBOL-formatted dataset.
- test_convert_options: set and get all ConvertOptions.
- test_header: parse header for column names.
- test_no_header: get column names from column_names option instead of first row.
- test_nulls_bools: read null and boolean values with leading/trailing whitespace.
- test_parse_options: set and get all ParseOptions.
- test_read_options: set and get all ReadOptions.
- test_serial_read: read table serially.
- test_skip_columns: have the parser skip the specified columns.
- test_small: threaded-read a small UTF8 dataset.
- test_small_encoded: threaded-read a small big5-encoded dataset.
python -m unittest pyfwfr.tests.test_fwf -v