fwf-reader

An extension to Apache Arrow based on their CSV reader for reading fixed-width files (tabular data where all fields in a column are padded to the same number of bytes) to Arrow tables. Written in C++, with Python bindings. Supports various encodings.

For an example of how to use this project, see notebooks/fwfr_sample_usage.ipynb.

Note: core components rely on Apache Arrow, and requires version 0.13 and 0.14. The installation script pulls in version 0.14. If the newest version of Apache Arrow, 0.15, is in the Conda environment, it will cause the installation to fail. I will be updating this project soon to account for these changes, probably in line with Apache Arrow's 1.0 release, when their internal structures will be more stable.

Installation

Installation from source

Install and build everything from source into your active Conda environment. Any missing dependencies are installed in Conda. Unit tests run automatically after installation.

conda create -n env
conda activate env
git clone https://github.com/kira-noel/fwfr.git
cd fwfr
./install.sh --source

If you want to use the C++ library without Python, the installation script also installs libfwfr.so and headers in $CONDA_PREFIX/{lib,include/fwfr}.

Note: if modifying setup.py (the file used to build the Python bindings), note that you cannot use distutils. Wheel uses these. Instead, use setuptools.

Reference

pyfwfr

Main Python module.

ParseOptions

Options for parsing FWF data.

field_widths: int list, required
The number of bytes in each field in a column of FWF data.

ignore_empty_lines: bool, optional (default True)
Whether empty lines are ignored in FWF input.

skip_columns: int list, optional (default empty)
Indexes of columns to skip on read-in.

import pyfwfr as pf
parse_options = pf.ParseOptions([6, 6, 6, 4], ignore_empty_lines=True, [0, 1, 6])
parse_options.field_widths  # displays [6, 6, 6, 4]
parse_options.field_widths = [4, 4, 4, 4]
parse_options.field_widths  # displays [4, 4, 4, 4]

ReadOptions

Options for reading FWF data.

encoding: string, optional (default "")
The encoding on the input data, if any. General note: the encoding names are flexible (case, dashes, etc.). Look https://demo.icu-project.org/icu-bin/convexp for a list of supported aliases. EBCDIC note: must append ',swaplfnl' ('cp1047' --> 'cp1047,swaplfnl'). EBCDIC encodings swap the order of carriage return and newline.

use_threads: bool, optional (default True)
Whether to use multiple thread to accelerate reading.

block_size: int, optional (default 1MB)
How many bytes to process at a time from the input stream. This will determine multi-threading granularity as well as the size of individual chunks in the table

skip_rows: int, optional (deafult 0)
Number of rows to skip at the beginning of the input stream.

column_names: list, optional
Column names (if empty, will attempt to read from first row after 'skip_rows').

import pyfwfr as pf
read_options = pf.ReadOptions(encoding="cp500,swaplfnl", use_threads=True, block_size=1024)

ConvertOptions

Options for converting FWF data.

column_types: dict, optional
Map column names to column types (disables type inferencing on those columns.

is_cobol: bool, optional (deafult False)
Whether to check for COBOL-formatted numeric types. Uses values provided in pos_values and neg_values for the conversion.

pos_values: dict, optional (default mapping provided)
COBOL values for interpreting positive numeric values.

neg_values: dict, optional (default mapping provided)
COBOL values for interpreting negative numeric values.

null_values: list, optional
A sequence of string that denote nulls in the data (defaults are appropriate in most cases).

true_values: list, optional
A sequence of string that denote true booleans in the data (defaults are appropriate in most cases).

false_values: list, optional
A sequence of string that denote true booleans in the data (defaults are appropriate in most cases).

strings_can_be_null: bool, optional (default False)
Whether string/binary columns can have null values. If true, then strings in null_values are considered null for string columns. If false, then all strings are valid string values.

import pyfwfr as pf
convert_options = pf.ConvertOptions()

read_fwf

Read a Table from a stream of FWF data. Must set parse_options.field_widths!

input_file: string, path or file-like object
parse_options: fwf.ParseOptions, required
read_options: fwf.ReadOptions, optional
convert_options: fwf.ConvertOptions, optional
memory_pool: MemoryPool, optional

import pyfwfr as pf
parse_options = pf.ParseOptions([6, 6, 6, 4])
read_options = pf.ReadOptions(encoding="big5")
table = pf.read_fwf(filename, parse_options, read_options=read_options)

get_library_dir

Return absolute path to libfwfr.so, the C++ base library.

get_include_dir

Return absolute path to C++ headers.

Unit tests

Current included tests:

test_big: threaded-read a large (big enough to use chunker) UTF8 dataset.
test_big_encoded: threaded-read a large (big enough to use chunker) big5-encoded dataset.
test_cobol: ensure column type and conversion for numeric COBOL-formatted dataset.
test_convert_options: set and get all ConvertOptions.
test_header: parse header for column names.
test_no_header: get column names from column_names option instead of first row.
test_nulls_bools: read null and boolean values with leading/trailing whitespace.
test_parse_options: set and get all ParseOptions.
test_read_options: set and get all ReadOptions.
test_serial_read: read table serially.
test_skip_columns: have the parser skip the specified columns.
test_small: threaded-read a small UTF8 dataset.
test_small_encoded: threaded-read a small big5-encoded dataset.

python -m unittest pyfwfr.tests.test_fwf -v

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bindings		bindings
cmake-modules		cmake-modules
conda-recipes/pyfwfr		conda-recipes/pyfwfr
generator		generator
notebooks		notebooks
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
build_pyfwfr_recipe.sh		build_pyfwfr_recipe.sh
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fwf-reader

Installation

Installation from source

Reference

pyfwfr

ParseOptions

ReadOptions

ConvertOptions

read_fwf

get_library_dir

get_include_dir

Unit tests

About

Releases

Packages

Languages

License

ke-noel/fwfr

Folders and files

Latest commit

History

Repository files navigation

fwf-reader

Installation

Installation from source

Reference

pyfwfr

ParseOptions

ReadOptions

ConvertOptions

read_fwf

get_library_dir

get_include_dir

Unit tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages