Skip to content

inputChecker

Binny Mathew Paul edited this page Oct 24, 2019 · 46 revisions

Input Checker Tool

Background

An ABM's performance is highly correlated to the quality and accuracy of its inputs. A problem in an input can affect the accuracy of the ABM outputs or can result in a run time or logical error during the model run. ABMs have a very complex design and uses various software (Visum, Python, Java, R, etc.) for different sub-modules. As a result, debugging an input related error can be a time-consuming and cumbersome process. Therefore, data must be prepared with utmost care to resolve most issues before starting the model run. However, even with the best of manual QA/QC efforts, it is not always possible to keep the input data 100% error-free. Specifically, with frequent input updates to run future year scenarios, sensitivity testing and software updates, the task of performing manual QA/QC checks on inputs becomes even more difficult. In this context, a process to automate the QA/QC checks on model inputs is highly desirable. The goal of these QA/QC checks must be to prevent errors during model run and result in meaningful outputs from the model run. The Input Checker Tool was developed with this goal in mind. The following sections give details of the setup and application of this tool.

Input Checker Implementation

The Input Checker Tool (inputChecker) was implemented in Python and makes heavy use of the pandas and numpy packages. The main inputs to inputChecker are a list of ABM input tables, a list of QA/QC checks to be performed on these input tables and the actual ABM inputs in CSV format. All CSV inputs are read as pandas data tables. The input checks are specified as Pandas expressions. The inputChecker solves these pandas expressions that operate on the input pandas data tables. The inputChecker goes through the list of all input checks and generates a LOG file summarizing the results of all input checks.

The inputChecker directory structure is shown below, followed by description of each element.

The elements of the inputChecker setup are described in the table below:

Directory/File Description
config directory Contains list of inputs (inputs_list.csv), list of checks (inputs_checks.csv) and a settings file (settings.csv)
inputs directory All the inputs specified in the inputs list are exported or copied to this directory
logs directory The Log and summary files from different runs are outputted to this directory
scripts directory Contains the main inputChecker Python script (inputChecker.py)
RunInputChecker.bat The batch file to run inputChecker

The RunInputChecker.bat DOS batch file is called by the RunModel.bat DOS batch file to run the inputChecker at the beginning of each SOABM run. The user can also launch the inputChecker independently by simply double-clicking the RunInputChecker.bat DOS batch file. However, the inputChecker working directory must be inside the SOABM working directory to read inputs from the appropriate input sub-directories.

Process Overview

An inputChecker run includes the following steps:

1. Read Inputs:

The inputChecker first reads all the inputs specified in the list of inputs from the inputChecker/inputs directory. After assembling all inputs in the inputChecker/inputs directory, all the inputs are loaded as pandas data-frames.

2. Run Checks

Next, the list of input checks is read. The inputChecker loops through the list of input checks and evaluates the checks. The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.

3. Run Self Diagnostics

Besides the checks specified by the user, the inputChecker also performs self-diagnostics to check for missing values in inputs. The severity level for the automated missing value checks is set via the config/settings.csv file.

4. Generate LOG File and Return Error Status

The final step is to generate the inputChecker log file. The inputChecker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of inputChecker results is also generated to be read by the RunModel.bat DOS batch file to generate a reminder message for the user at the end of the SOABM run. An appropriate exit code is returned depending on the outcome of the inputChecker run. The table below describes the various outcomes and the associated exit codes:

inputChecker End State Exit Code
inputChecker ran successfully with no fatal checks fails 0
inputChecker did not run successfully due to errors 1
inputChecker ran successfully with at least one fatal check fails 2

With a return code of 0, the RunModel.bat DOS batch file resumes the SOABM run. A reminder message is generated at the end to check the inputChecker log file. In case the inputChecker errors out, the model run is aborted. If the inputChecker completes with at least one fatal check fails, the RunModel.bat DOS batch file aborts the SOABM run and user is directed to check the inputChecker log file.

Configuring Input Checker

Configuring inputChecker involves specifying inputs and checks. This section describes the configuration details of the two settings file - config/inputs_list.csv and config/inputs_checks.csv.

Specifying inputs

Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv file. Each row in inputs_list.csv represents an ABM input. The attributes that user must specify for each input are described in the table below:

Attribute Description
Table The name of the input table. The inputs are loaded into inputChecker memory as data-frames under this name. For CSV inputs, this must match the CSV file name.
Directory The location of the CSV input file - SOABM inputs directory or SOABM uec directory
Visum_Object The name of the Visum Object whose attributes must be exported. Must be specified as 'NA' for CSV inputs
Input_ID_Column The name of the unique ID column. inputChecker creates an ID column by specified name if the column is missing from the input table
Fields The list of attributes to be exported from the Visum network object. All the fields are read for CSV inputs
Column_Map A column name can be specified is some columns must be renamed for easy reference
Input_Description The description of the input file.

All the inputs must be in CSV format. Some ABM inputs may not be available in CSV format. Specifically, network related inputs are usually embedded in a transportation modeling software database. For the Visum-based SOABM, the Visum version file, SOABM.ver, contains the multiple zone systems, all zonal attributes, the highway network, and the transit network. The export_csv module of inputChecker loads the SOABM.ver file and exports the specified network inputs to inputChecker/inputs directory in CSV format. The attributes of the specified Visum network objects are exported in CSV format in the inputChecker/inputs directory. The inputChecker assumes that the SOABM.ver file exists within the input sub-directory of the SOABM working directory. The user must specify each input either as a Visum object (e.g., Visum.Net.Links) or a csv file in the inputs or uec sub-directories. The non-CSV inputs are copied from the specified sub-directory to the inputChecker/inputs directory. Columns are renamed as per user specification and an ID column is generated if not specified.

The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by inputChecker

Specifying checks

The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv file. Each row in checks_list.csv represents a specific operation to be performed on a specific input listed in inputs_list.csv. The operations are evaluated in the same order as they are listed in inputs_list.csv. Each operation can be classified as a Test or Calculation. For Test operations, the pandas expression is evaluated and the result is sent to the logging module of inputChecker for logging. For Calculation operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that user must specify for each Test or Calculation operation:

Attribute Description
Test The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object
Input_Table The name of the input table on which the check is to be performed. This name must match the name specified under the Table token in inputs_list
ID_Column The unique ID column name. This must match the name specified under the Input_ID_Column token in inputs_list
Severity The severity level of the test - Fatal, Logical or Warning
Type The type of operation - Test or Calculation
Expression The pandas expression to be evaluated
Test_Vals A list of values on which the test needs to be repeated. List must be comma seperated. Test for each value is logged separately
Report-Statistic Any additional statistics from the test that must be reported to the log file
Test_Description The description of the check that is being performed
Severity levels

An important step in specifying checks is assigning a severity level to each check. inputChecker allows user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:

Fatal

The failure of fatal checks would result in a fatal runtime errors in the ABM run. In case of a fatal failure, inputChecker returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, only check that are guaranteed to result in a fatal model error must be assigned a FATAL severity level.

Logical

The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.

Warnings

The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.

Expressions

At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test expression must evaluate to a single logical value (TRUE or FALSE) or a vector of logical values. Therefore, the Test expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND, OR, EQUAL, GREATER THAN, LESS THAN, IN, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation expression can be any Python data type to be used by a sub-sequent expression.

The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE. In case of a vector result, the test is declared as failed if any value is FALSE. Therefore, the expression must be designed to evaluate to TRUE if there are no problems in the input data.

Conventions for writing expressions

Rules and conventions for writing inputChecker expressions are summarized below:

  • Expressions must be a valid Python/pandas expression
  • Expressions must be designed to evaluate to FALSE to indicate any errors in data
  • Each expression must evaluate to logical value(s)
  • Each expression must be applied to valid input table specified in inputs_list.csv and make use of intermediate tables created by preceding Calculation expressions
  • Expressions must use the same table names as specified in inputs_list.csv or the Test name of the Calculation object
  • Expressions must use the same field names as specified in inputs_list.csv. If a column map was specified, then the new names must be used
  • Expressions can be looped over a list of Test_Vals to reduce number of expressions
  • The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
  • Expressions can be commented by adding a "#" in front of the Test name. All checks whose test name starts with a "#" are ignored by inputChecker
Example expressions

Below are some example expressions for different types of checks

Data completeness checks

Check if household income field exists in the input synthetic population

'adjinc' in households.columns

For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals token (separated by comma):

test_val in households.columns
Boundary checks

Check if household size ('np') is greater than zero for each household

households.np>0
Pre-defined value checks

Check if each person's occupation code ('occp') matches the pre-defined occupation codes

persons.occp.apply(lambda x: True if x in [1,2,3,4,5,6,999] else False)

It is possible that all person records pass the above test but one of the occupation code may not have a single person record. To check for such cases, following expression can be used:

set(persons.occp)=={1,2,3,4,5,6,999}
Consistency checks

Check if total employment across occupation categories sum to total employment for each MAZ. Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation expression:

maz_data[[col for col in maz_data if (col.startswith('EMP')) and not (col.endswith('TOTAL'))]].sum(axis=1)

The result of the above expression is a MAZ level vector - maz_total_employment Next, the total employment field can be compared against maz_total_employment

maz_data.EMP_TOTAL==maz_total_employment
Order checks

Check if household IDs start from 1 and are sequential

(min(households.hhid)==1) & (max(households.hhid)==len(set(households.hhid)))
Logical checks

To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the number of workers against available jobs in each industry. While they may not match exactly but the difference must not exceed 10%. For this check, first the number of workers and jobs by industry type must be calculated. This can be achieved by a series of Calculate operations. Next, the check can be performed for each industry type separately

0.9 < maz_occp_jobs[0]/person_occ_workers[1] < 1.1

More than the test result, user might be interested in knowing the actual ratio of job to employment. Therefore, a Report_Statistic can be specified for this test as maz_occp_jobs[0]/person_occ_workers[1]

Network checks

While most of the above checks apply to link and node level attributes, some checks might be unique to some specific network objects. In Visum, the transit line route names must be unique. This requires performing a check on transit line route data as follows:

len(set(lineroute_data.NAME)) == len(lineroute_data.NAME)

The design of network level checks will depend on the transportation modeling software being used.

Analyzing Input Checker Log

The final output from the inputChecker is a log file which is outputted to the inputChecker/logs directory. The log file is named as inputCheckerLog[RUN_DATE].LOG. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.

Organization

The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. inputChecker organizes the check results under the following headings:

  • IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
  • ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
  • WARNINGS: All failed WARNING checks are logged under this heading
  • LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks
  • MISSING VALUE DIAGNOSTICS ON ALL INPUTS: All failed missing value self-diagnostics tests are logged under this section

Check LOG

A standard check log is generated for each check. The table below shows the elements of a check LOG:

Attribute Description
Input File Name The name of the input file on which the check was evaluated
Input File Location Path to the location of the input file
Visum Object The name of the Visum object, if applicable
Input Description The decription of the input as specified in inputs_list.csv
Test Name The name of the test as specified in checks_list.csv
Test Description The description of the test
Test Severity The severity level of the test
TEST RESULT The result of the test - PASSED or FAILED
TEST results for Test_Vals Test result for each Test val on which the test was repeated
Test Statistics The value of the expression specified under the Report_Statistic token of checks_list.csv. First 25 values are printed in case of vector result
ID Column The name of the unique ID column of the input data table
List of failed IDs The first 25 IDs for which the test failed. This is generated in case of vector result
Number of failures Total number of failures in case of vector result

Summary file

In addition to the log file, inputChecker also produces a text file (inputCheckerSummary.txt) with a summary of inputChecker fail by their severity levels. This file is read by the main ABM batch script to present the summary at the end of the model run.

Clone this wiki locally