-
Notifications
You must be signed in to change notification settings - Fork 5
inputChecker
An ABM's performance is highly correlated to the quality and accuracy of its inputs. A problem in an input can affect the accuracy of the ABM outputs or can result in a run time or logical error during the model run. ABMs have a very complex design and uses various software (Visum, Python, Java, R, etc.) for different sub-modules. As a result, debugging an input related error can be a time-consuming and cumbersome process. Therefore, data must be prepared with utmost care to resolve most issues before starting the model run. However, even with the best of manual QA/QC efforts, it is not always possible to keep the input data 100% error-free. Specifically, with frequent input updates to run future year scenarios, sensitivity testing and software updates, the task of performing manual QA/QC checks on inputs becomes even more difficult. In this context, a process to automate the QA/QC checks on model inputs is highly desirable. The goal of these QA/QC checks must be to prevent errors during model run and result in meaningful outputs from the model run. The Input Checker Tool was developed with this goal in mind. The following sections give details of the setup and application of this tool.
The Input Checker Tool (inputChecker) was implemented in Python and makes heavy use of the pandas and numpy packages. The main inputs to inputChecker are a list of ABM input tables, a list of QA/QC checks to be performed on these input tables and the actual ABM inputs in CSV format. All CSV inputs are read as pandas data tables. The input checks are specified as Pandas expressions. The inputChecker solves these pandas expressions that operate on the input pandas data tables. The inputChecker goes through the list of all input checks and generates a LOG file summarizing the results of all input checks.
The inputChecker directory structure is shown below, followed by description of each element.
The elements of the inputChecker setup are described in the table below:
Directory/File | Description |
---|---|
config directory | Contains list of inputs (inputs_list.csv), list of checks (inputs_checks.csv) and a settings file (settings.csv) |
inputs directory | All the inputs specified in the inputs list are exported or copied to this directory |
logs directory | The Log and summary files from different runs are outputted to this directory |
scripts directory | Contains the main inputChecker Python script (inputChecker.py) |
RunInputChecker.bat | The batch file to run inputChecker |
The RunInputChecker.bat DOS batch file is called by the RunModel.bat DOS batch file to run the inputChecker at the beginning of each SOABM run. The user can also launch the inputChecker independently by simply double-clicking the RunInputChecker.bat DOS batch file. However, the inputChecker working directory must be inside the SOABM working directory to read inputs from the appropriate input sub-directories.
An inputChecker run includes the following steps:
The inputChecker first reads all the inputs specified in the list of inputs from the inputChecker/inputs
directory. After assembling all inputs in the inputChecker/inputs
directory, all the inputs are loaded as pandas data-frames.
Next, the list of input checks is read. The inputChecker loops through the list of input checks and evaluates the checks. The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.
Besides the checks specified by the user, the inputChecker also performs self-diagnostics to check for missing values in inputs. The severity level for the automated missing value checks is set via the config/settings.csv
file.
The final step is to generate the inputChecker log file. The inputChecker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of inputChecker results is also generated to be read by the RunModel.bat DOS batch file to generate a reminder message for the user at the end of the SOABM run. An appropriate exit code is returned depending on the outcome of the inputChecker run. The table below describes the various outcomes and the associated exit codes:
inputChecker End State | Exit Code |
---|---|
inputChecker ran successfully with no fatal checks fails | 0 |
inputChecker did not run successfully due to errors | 1 |
inputChecker ran successfully with at least one fatal check fails | 2 |
With a return code of 0, the RunModel.bat DOS batch file resumes the SOABM run. A reminder message is generated at the end to check the inputChecker log file. In case the inputChecker errors out, the model run is aborted. If the inputChecker completes with at least one fatal check fails, the RunModel.bat DOS batch file aborts the SOABM run and user is directed to check the inputChecker log file.
Configuring inputChecker involves specifying inputs and checks. This section describes the configuration details of the two settings file - config/inputs_list.csv
and config/inputs_checks.csv
.
Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv
file. Each row in inputs_list.csv
represents an ABM input. The attributes that user must specify for each input are described in the table below:
Attribute | Description |
---|---|
Table | The name of the input table. The inputs are loaded into inputChecker memory as data-frames under this name. For CSV inputs, this must match the CSV file name. |
Directory | The location of the CSV input file - SOABM inputs directory or SOABM uec directory |
Visum_Object | The name of the Visum Object whose attributes must be exported. Must be specified as 'NA' for CSV inputs |
Input_ID_Column | The name of the unique ID column. inputChecker creates an ID column by specified name if the column is missing from the input table |
Fields | The list of attributes to be exported from the Visum network object. All the fields are read for CSV inputs |
Column_Map | A column name can be specified is some columns must be renamed for easy reference |
Input_Description | The description of the input file. |
All the inputs must be in CSV format. Some ABM inputs may not be available in CSV format. Specifically, network related inputs are usually embedded in a transportation modeling software database. For the Visum-based SOABM, the Visum version file, SOABM.ver
, contains the multiple zone systems, all zonal attributes, the highway network, and the transit network. The export_csv
module of inputChecker loads the SOABM.ver
file and exports the specified network inputs to inputChecker/inputs
directory in CSV format. The attributes of the specified Visum network objects are exported in CSV format in the inputChecker/inputs
directory. The inputChecker assumes that the SOABM.ver
file exists within the input
sub-directory of the SOABM working directory. The user must specify each input either as a Visum object (e.g., Visum.Net.Links) or a csv file in the inputs
or uec
sub-directories. The non-CSV inputs are copied from the specified sub-directory to the inputChecker/inputs
directory. Columns are renamed as per user specification and an ID column is generated if not specified.
The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv
, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by inputChecker
The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv
file. Each row in checks_list.csv
represents a specific operation to be performed on a specific input listed in inputs_list.csv
. The operations are evaluated in the same order as they are listed in inputs_list.csv
. Each operation can be classified as a Test
or Calculation
. For Test
operations, the pandas expression is evaluated and the result is sent to the logging module of inputChecker for logging. For Calculation
operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that user must specify for each Test
or Calculation
operation:
Attribute | Description |
---|---|
Test | The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object |
Input_Table | The name of the input table on which the check is to be performed. This name must match the name specified under the Table token in inputs_list
|
ID_Column | The unique ID column name. This must match the name specified under the Input_ID_Column token in inputs_list
|
Severity | The severity level of the test - Fatal, Logical or Warning |
Type | The type of operation - Test or Calculation
|
Expression | The pandas expression to be evaluated |
Test_Vals | A list of values on which the test needs to be repeated. List must be comma seperated. Test for each value is logged separately |
Report-Statistic | Any additional statistics from the test that must be reported to the log file |
Test_Description | The description of the check that is being performed |
An important step in specifying checks is assigning a severity level to each check. inputChecker allows user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:
The failure of fatal checks would result in a fatal runtime errors in the ABM run. In case of a fatal failure, inputChecker returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, only check that are guaranteed to result in a fatal model error must be assigned a FATAL
severity level.
The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.
The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.
At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test
expression must evaluate to a single logical value (TRUE
or FALSE
) or a vector of logical values. Therefore, the Test
expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND
, OR
, EQUAL
, GREATER THAN
, LESS THAN
, IN
, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation
expression can be any Python data type to be used by a sub-sequent expression.
The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE
. In case of a vector result, the test is declared as failed if any value is FALSE
. Therefore, the expression must be designed to evaluate to TRUE
if there are no problems in the input data.
Rules and conventions for writing inputChecker expressions are summarized below:
- Expressions must be a valid Python/pandas expression
- Expressions must be designed to evaluate to
FALSE
to indicate any errors in data - Each expression must evaluate to logical value(s)
- Each expression must be applied to valid input table specified in
inputs_list.csv
and make use of intermediate tables created by precedingCalculation
expressions - Expressions must use the same table names as specified in
inputs_list.csv
or the Test name of theCalculation
object - Expressions must use the same field names as specified in
inputs_list.csv
. If a column map was specified, then the new names must be used - Expressions can be looped over a list of
Test_Vals
to reduce number of expressions - The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
- Expressions can be commented by adding a "#" in front of the
Test
name. All checks whose test name starts with a "#" are ignored by inputChecker
Below are some example expressions for different types of checks
Check if household income field exists in the input synthetic population
'adjinc' in households.columns
For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals
token (separated by comma):
test_val in households.columns
Check if household size ('np') is greater than zero for each household
households.np>0
Check if each person's occupation code ('occp') matches the pre-defined occupation codes
persons.occp.apply(lambda x: True if x in [1,2,3,4,5,6,999] else False)
It is possible that all person records pass the above test but one of the occupation code may not have a single person record. To check for such cases, following expression can be used:
set(persons.occp)=={1,2,3,4,5,6,999}
Check if total employment across occupation categories sum to total employment for each MAZ. Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation
expression:
maz_data[[col for col in maz_data if (col.startswith('EMP')) and not (col.endswith('TOTAL'))]].sum(axis=1)
The result of the above expression is a MAZ level vector - maz_total_employment
Next, the total employment field can be compared against maz_total_employment
maz_data.EMP_TOTAL==maz_total_employment
Check if household IDs start from 1 and are sequential
(min(households.hhid)==1) & (max(households.hhid)==len(set(households.hhid)))
To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the number of workers against available jobs in each industry. While they may not match exactly but the difference must not exceed 10%. For this check, first the number of workers and jobs by industry type must be calculated. This can be achieved by a series of Calculate
operations. Next, the check can be performed for each industry type separately
0.9 < maz_occp_jobs[0]/person_occ_workers[1] < 1.1
More than the test result, user might be interested in knowing the actual ratio of job to employment. Therefore, a Report_Statistic
can be specified for this test as maz_occp_jobs[0]/person_occ_workers[1]
While most of the above checks apply to link and node level attributes, some checks might be unique to some specific network objects. In Visum, the transit line route names must be unique. This requires performing a check on transit line route data as follows:
len(set(lineroute_data.NAME)) == len(lineroute_data.NAME)
The design of network level checks will depend on the transportation modeling software being used.
The final output from the inputChecker is a log file which is outputted to the inputChecker/logs
directory. The log file is named as inputCheckerLog[RUN_DATE].LOG
. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.
The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. inputChecker organizes the check results under the following headings:
- IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
- ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
- WARNINGS: All failed WARNING checks are logged under this heading
- LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks
- MISSING VALUE DIAGNOSTICS ON ALL INPUTS: All failed missing value self-diagnostics tests are logged under this section
A standard check log is generated for each check. The table below shows the elements of a check LOG:
Attribute | Description |
---|---|
Input File Name | The name of the input file on which the check was evaluated |
Input File Location | Path to the location of the input file |
Visum Object | The name of the Visum object, if applicable |
Input Description | The decription of the input as specified in inputs_list.csv
|
Test Name | The name of the test as specified in checks_list.csv
|
Test Description | The description of the test |
Test Severity | The severity level of the test |
TEST RESULT | The result of the test - PASSED or FAILED |
TEST results for Test_Vals | Test result for each Test val on which the test was repeated |
Test Statistics | The value of the expression specified under the Report_Statistic token of checks_list.csv . First 25 values are printed in case of vector result |
ID Column | The name of the unique ID column of the input data table |
List of failed IDs | The first 25 IDs for which the test failed. This is generated in case of vector result |
Number of failures | Total number of failures in case of vector result |
In addition to the log file, inputChecker also produces a text file (inputCheckerSummary.txt
) with a summary of inputChecker fail by their severity levels. This file is read by the main ABM batch script to present the summary at the end of the model run.
- Getting Started
- RunModel bat file
- Networks and Zone Data
- Auto Network Coding
- VDF Definition
- Transit Network Coding
- Non-motorized Network Coding
- Editing Land Use Data
- Running the Population Synthesizer
- Input Checker
- Analyzing Model Outputs
- Commercial Vehicle Model
- External Model
- Model Cost Inputs
- Value of Time
- Person Type Coding Logic
- MSA Feedback
- VMT Computation
- Shadow Pricing Mechanism
- Methodology for Developing TAZ Boundaries
- Methodology for Developing MAZ Boundaries
- Methodology for Developing TAPS
- Source of Land-Use Inputs
- Major University Model (Optional)
- Running Transit Everywhere Scenario
- Building the ABM Java Program
- Debugging ABM Python Code
- ABM Cleaning Protocol
- Updating to New Visum
- Troubleshooting and Debugging