Changes to the schema should not be merged until all of the changes are covered by tests, and the tests pass.
Exceptions may be made for rapid iteration on pre-release changes, where test coverage/failure is documented, and expected to be resolved before a new version is released.
In summary, we test:
- Example data (in the docs) is valid and well-formed.
- The schema and codelists are well-formed and valid JSON.
- The constraints expressed in the JSON schema work as expected to validate data.
BODS v0.4 uses JSON Schema 2020-12 as its base metaschema, extended with some custom properties which are used to further constrain the BODS schema. These are:
codelist
(string): The filename of a .csv in the BO Data Standard which defines the allowed values for this property.openCodelist
(boolean): If true, the property can contain values beyond what is defined in the codelist in the BO Data Standard. If false,the property is restricted to only the values defined in the codelist.version
(string): The BODS schema version number.propertyOrder
(integer): The order in which properties should be displayed for optimised user experience. Properties whose values are not objects or arrays should be listed first.
Properties from the extended metaschema should not be present in any BODS data, only in the schema. Therefore they don't need to be documented for data publishers, only for schema architects and developers.
The metaschema file is found in data-standard/tests/schema/meta-schema.json
. As part of the data standard repository tests, the BODS schema files are validated against the metaschema.
Test files are found in the data-standard/tests
directory.
Tests for the BODS schema are organised into:
- Schema tests: These validate the structure of the schema files (including codelist CSVs), and compliance with the metaschema.
- Data tests: These test the schema against valid and invalid sample data, to check that the schema constrains data as expected. Data for these tests is organised in several subdirectories under
data-standards/tests/data
. - Docs tests: These test the data snippets and example data used in the data standard documentation, to make sure they are formatted correctly and valid BODS data.
The tests are written using pytest. Fixtures for fetching files, loading the schema, creating a validator, and other helper functions can be found in conftest.py
.
The tests and a flake8 code quality check are run automatically when a branch is pushed to the data-standard
repository.
We use Black, * iSort and flake8 for code linting. Pull requests are automatically checked and must pass these before they can be merged.
Tests can be run in your local development environment (ie. in a virtualenv or docker container or similar) from inside the data-standard
repository.
Make sure the test requirements are installed, ie.:
pip install -r requirements_test.txt
To run all the tests:
pytest tests/
To run one set of tests, eg.:
pytest tests/test_schema.py
To run code linting:
flake8 tests/
(There is no output if all the code is conformant.)
and:
black tests/ --line-length=119
The tests in the data standard repository are present to validate that the JSON Schema works as expected. They are not to validate data, and don't test any requirements imposed on data by the data standard which are not enforced by the JSON Schema - these should be covered by a validation tool.
- If a schema file is added or removed, or an
$id
value is changed, theschemas
variable needs to be updated intest_schema.py
. - If additional requirements are placed on how the schema is structured or formatted (eg. letter case of fields, indentation) tests for these should be added to
test_schema.py
. - New codelists are tested automatically; nothing needs to be added if these change.
If constraints are added to or removed from the JSON schema (eg. a string field which previously had no maximum length now has a maximum length), valid and invalid test data should be added to the appropriate subdirectory in tests/data/
.
Use one file per requirement, with the minimum contents possible to test only the requirement in question. This is so that if any requirements change in future, we have the minimum amount to update in the test files.
Name the test files to make it clear which requirement is being tested.
After adding new files make sure to run the tests (pytest tests/test_data.py
) to check they pass.
A minimum valid BODS entity statement looks like this:
[
{
"statementId": "2f7bf9370f1254068e5e946df067d07d",
"declarationSubject": "xyz",
"statementDate": "2017-11-18",
"recordId": "123",
"recordType": "entity",
"recordDetails": {
"entityType": {
"type": "unknownEntity"
},
"isComponent": false
}
}
]
Start from a minimal statement like this, and add only the field you are testing. If you are testing a field in a nested object (eg. publicationDetails/publisher/name
) you may need to add more data to cover additional required fields (eg. publicationDetails/publicationDate
). Check the schema itself to find out which fields are required for the various objects.
As with valid data, start from a minimal valid statement and add invalid values (or remove in the case of a required field) for only the field you are testing. There should be one validation error per file only. The test will fail if there is more than one.
We also have to test that the validation error is the one we expect, so we need to map the data files to the type and location of the error we're looking for. Do this by updating expected_errors.csv
(in the same directory as the invalid data). The structure of this file is:
- file name (eg. "entity_addressType_placeOfBirth.json")
- validation keyword (the type of error we expect, eg. "enum")
- json path (the path to the location of the error in the data being tested, eg. "$[0].recordDetails.addresses[0].type")
- property (the property in the data which is the subject of the test)
Validation keywords are from the JSON Schema standard, and are one of:
required
: the property is missingtype
: the value is the wrong data typeconst
: the value is not the one required by the schemaenum
: the value is not one of a set required by the schemamultipleOf
: the value is not the multiple requiredmaximum
: the value is too highexclusiveMaximum
: the value is too highminimum
: the value is too lowexclusiveMinimum
: the value is too lowmaxLength
: the value is too longminLength
: the value is too shortpattern
: the value does not match the defined patternmaxItems
: the array has too many itemsminItems
: the array has too few itemsuniqueItems
: the array contains duplicatesmaxContains
: the array contains to many items of the type allowed by thecontains
subschemaminContains
: the array contains too few items of the type allowed by thecontains
subschemamaxProperties
: the object contains too many propertiesminProperties
: the object contains too few propertiesdependentRequired
: a property dependent on another property is missing
The validation keyword may sometimes need to be set to oneOf
, anyOf
or allOf
if a value is constrained by multiple possible subschemas, rather than the keyword actual validation that is taking place. Making this more precise is a todo.
The JSON path always begins with $[0]
because each test file is an array of one statement. The path is separated by .
. Elements in arrays are represented by [0]
, [1]
, etc. for the position of the error in the array. When the error relates to a missing required field, the JSON path ends at the parent. ie. To test a missing statementDate
, the JSON path is $[0]
(because that is the location of the required field error), not $[0].statementDate
.
The property is the name of the specific field you're testing. To test a missing statementDate
, set this value to statementDate
. To test an incorrect address type, set this to type
.
- If example data is added or removed from the
/examples
directory, you don't need to make any changes to the tests - these are picked up and validated automatically.