Skip to content

jayclassless/tabfilereader

Repository files navigation

Welcome

Overview

tabfilereader is a small library to make reading flat, tabular data from files a bit less tedious.

At its base, to use tabfilereader, you simply define your Schema, then use it to open a Reader. You can then iterate through the Reader to retrieve records from the file.

>>> import tabfilereader as tfr
>>> class MySchema(tfr.Schema):
...     column1 = tfr.Column('column_1')
...     column2 = tfr.Column('column_2', data_type=tfr.IntegerType(), data_required=True)
>>> reader = tfr.CsvReader.open('test/data/simple_header.csv', MySchema)
>>> for record, errors in reader:
...     print(record)
Record(column1='foo', column2=123)
Record(column1='bar', column2=None)

Schemas

Schema classes tell tabfilereader what columns to expect in the file, and what datatypes the values contained in them should be cast as. You create your schemas by defining a class that inherits from tabfilereader.Schema. In this class, you define properties that are instances of tabfilereader.Column, which specify where columns are in the file, and what their datatype is. An example:

>>> import re
>>> class ExampleSchema(tfr.Schema):
...     first = tfr.Column('First Name')
...     last = tfr.Column('Last Name', data_required=True)
...     birthdate = tfr.Column(re.compile(r'^Birth.*'), data_type=tfr.DateType())
...     weight = tfr.Column('Weight', data_type=tfr.FloatType(), required=False)

Columns require at least one argument that tells tabfilereader how to find the column in the file. For files where the first record contains column names, you can specify either:

  • The exact name of the column as a string.
  • An re.Pattern that will match the column name.
  • A sequence of strings or re.Pattern objects that the column could possibly be named as.

For files that do not contain a header record, you specify the column's location with an zero-based integer index.

Columns also take a series of optional parameters:

required
To indicate whether or not it is required that this column exists in the file. Defaults to True.
data_required
To indicate whether or not the column must have a value for every record in the file. Defaults to False.
data_type
With this parameter, you can provide a callable that will receive a string value from the file and return a parsed and properly-typed value. If the value is invalid, the callable should throw a ValueError. tabfilereader provides an array of pre-defined Types that you can use here for the most common data types (numbers, dates, strings, etc). See the API documentation for all the available pre-defined Types. This parameter defaults to tabfilereader.StringType() if not specified.

There are also a handful of optional parameteres that can be declared on the Schema itself. The available options are:

ignore_unknown_columns
To indicate what should be done if a Reader finds columns in the file that are not declared in the Schema. Defaults to False, which means the Reader will throw an exception.
ignore_empty_records
To indicate what should be done if a Reader encounters a record with no columns whatsoever. Defaults to False, which means the reader will return a record that is full of errors. This option is particularly useful for CSV files when people are a bit sloppy with their newlines at the end of a file.

To set these Schema-level options, pass them as keyword arguments in the class declaration:

>>> class SchemaWithOptions(tfr.Schema, ignore_unknown_columns=True):
...     column1 = tfr.Column('column_1')

Readers

Readers use the Schemas to interpret the contents of the tabular files. tabfilereader provides the following Readers to handle various types of files:

CsvReader
Handles Comma Separated Value files (or similarly-constructed files; TSV, etc).
ExcelReader
Handles Excel spreadsheets; either XLS- or XLSX-formatted.
OdsReader
Handles OpenDocumentFormat spreadsheets.

Readers can be created by either calling the open() classmethod on the specific Reader class you want to use, or by defining your own Reader class that inherits from one provided by tabfilereader like so:

>>> class MyReader(tfr.CsvReader):
...     schema = MySchema
...     delimiter = '|'

>>> reader = MyReader('test/data/simple_header_pipe.csv')

Each reader allows for a variety of optional parameters (like delimiter in the example above). See the API documentation for a full listing of the options for each.

Readers are iterable. Each iteration returns a tuple of two values. The first value is a Record that contains the values from the file. The second value is a collection of all the errors encountered when trying to parse the values in the columns.

>>> record, errors = next(reader)
>>> record.column1
'foo'
>>> record['column2']
123
>>> bool(errors)
False
>>> record, errors = next(reader)
>>> record.column1
'bar'
>>> record['column2'] is None
True
>>> bool(errors)
True
>>> errors['column2']
'A value is required'

License

This project is released under the terms of the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published