Skip to content

Commit

Permalink
Merge pull request #156 from digital-land/gs/add-QA-processes
Browse files Browse the repository at this point in the history
Gs/add qa processes
  • Loading branch information
greg-slater authored Nov 25, 2024
2 parents 4e747b3 + 5301680 commit dffe584
Show file tree
Hide file tree
Showing 7 changed files with 218 additions and 98 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Endpoint URL Types and Plugins

The pipeline can collect data published in a wide range of different formats, which means there is a lot of variety in the types of URLs we might add as endpoints. Broadly, however, endpoints typically fall into one of the following two categories:
- Hosted file - these will usually be URL which ends in something like `.json` or `.csv`
- Standards compliant web server - these will usually be identifiable by parts of the URL like `MapServer` or `FeatureServer`, or sections that look like query parameters, like `?service=WFS&version=1.0.0`

## Data formats of resources that can be processed**
- Geojson (preferred for geospatial data because this mandates Coordinate Reference System WGS84)
- CSV text files containing WKT format geometry
- PDF-A
- Shapefiles
- Excel files containing WKT format geometry (xls, xlsx, xlsm, xlsb, odf, ods, odt)
- Mapinfo
- Zip files containing Mapinfo or Shapefiles
- GML
- Geopackage
- OGC Web Feature Service
- ESRI ArcGIS REST service output in GeoJSON format

**Hosted files**
These can typically be added as they are with no problems. The pipeline can read most common formats and will transform them into the csv format it needs if they’re not already supplied as csv.

**Web servers**
Web server endpoints usually provide some flexibility around the format that data is returned in. The data provider may have shared a correctly configured URL which returns valid data, or they may have just provided a link to the server service directory, which does not itself contain data we can process.

E.g. this URL from Canterbury provides information on a number of different planning-related layers available from their ArcGIS server:
`https://mapping.canterbury.gov.uk/arcgis/rest/services/External/Planning_Constraints_New/MapServer`

Depending on the endpoint, it may be necessary to either **edit the URL** to return valid data, or **use a plugin** to make sure the data is processed correctly. A plugin is typically needed for an API endpoint if the collector needs to paginate (e.g. the ArcGIS API typically limits to 1,000 records per fetch) or strip unnecessary content from the response (e.g. WFS feeds can sometimes contain access timestamps which can result in a new resource being created each day the collector runs).

>[!NOTE]
> Wherever possible, we prefer to collect from a URL which returns data in an open standard such as geojson or WFS, instead of the ArcGIS service directory page.
40 changes: 20 additions & 20 deletions docs/data-operations-manual/Explanation/Operational-Procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,29 @@ These procedures can vary based on whether a dataset is national or compiled, wh

To help with this complexity, we've got a few levels of documentation to help:

1. This explanatory overview is at the highest level.
2. Below that, the tutorials section covers a range of different scenarios that can occur when [adding](Adding-Data) and [maintaining](Maintaining-Data) data and explain the procedure that should be followed in each one.
1. This explanatory overview is at the highest level, and gives a basic explanation of some of the key steps in the data lifecycle.
2. Below that, the tutorials section documents some of the standard processes we follow for each of these steps, and also explains some different scenarios that can occur when [adding](Adding-Data) and [maintaining](Maintaining-Data) data to make it clear how processes can vary slightly in different situations.
3. The procedure steps in the scenarios link to the most detailed level of documentation - the [how-to guides](How-to-guides) - which give step by step instructions for how to complete particular tasks.

## Validating data

When receiving data from the LPA, we need to first validate the data to check that it conforms to our data requirements.

Depending on the dataset, the LPAs usually use the [planning form](https://submit.planning.data.gov.uk/check/) to check if the data is good to go. They don't do that all the time though, so we still need to manually validate the data. However, the check tool does not yet work for Brownfield-land/site datasets so we always need to validate the data on our end.
Depending on the dataset, the LPAs usually use the [check service](https://submit.planning.data.gov.uk/check/) to check whether their data meets the specifications. But in most cases we still carry out validation checks ourselves before adding data.

Read the [how to validate an endpoint guide](Validate-an-endpoint) to see the steps we follow.

## Adding data

There are two main scenarios for adding data:

- Adding an endpoint for a new dataset and/or collection (e.g. we don't have a the dataset on file at all)
- Adding a new endpoint to an existing dataset
- Adding a new endpoint to an existing dataset. This will usually be for a compiled, ODP dataset (e.g. adding a new endpoint from a Local Planning Authority to the `article-4-direction-area` dataset).

Based on this, the process is slightly different.
- Adding an endpoint for a new dataset and/or collection. This is usually for a national-scale dataset which is being added to the platform (e.g. adding `flood-storage-area` data from the Environment Agency to the platform for the first time).

A how-to on adding a new dataset and collection can be found [here](Add-a-new-dataset-and-collection).
The [adding data](../Tutorials/Adding-Data.md) page in the tutorials section explains the process we follow for each of these scenarios.

A how-to on adding a new endpoint to an existing dataset can be found [here](Add-an-endpoint). Endpoints can come in a variety of types. The format can differ from endpoint to endpoint as can the required plugins needed to process the endpoint correctly.

More information on types can be found [here](Endpoint-URL-Types-And-Plugins#data-formats-of-resources-that-can-be-processed)

More information on plugins can be found [here](Endpoint-URL-Types-And-Plugins#adding-query-parameters-to-arcgis-server-urls)
You may find it useful to read some of the Key Concepts documentation, in particular on [pipeline processes and the data model](../Explanation/Key-Concepts/pipeline-processes.md) and [endpoint types](../Explanation/Key-Concepts/Endpoint-types.md).

## Maintaining data

Expand All @@ -43,13 +38,14 @@ Maintaining data means making sure that the changes a data provider makes to the

All entries on the platform must be assigned an entity number in the `lookup.csv` for the collection. This usually happens automatically when adding a new endpoint through the `add-endpoints-and-lookups` script. However, when an endpoint is already on the platform but the LPA has indicated that the endpoint has been updated with a new resource and new entries, we can’t just re-add the endpoint. Instead, we assign the new entries their entity numbers differently.

A how-to on assigning entities can be found [here](Assign-entities)
The [maintaining data](../Tutorials/Maintaining-Data.md) page in Tutorials covers some of these different scenarios and the steps that should be followed for each.


### Merging entities

There can be duplicates present in a dataset. This primarily takes place where multiple organisations are providing data against the same object (or entity). We do not automatically detect and remove these, the old-entity table is used to highlight these duplications and provide them under a single entity number.
There can be duplicates present in a dataset. This primarily takes place where multiple organisations are providing data about the same entity. We do not automatically detect and remove these. Instead, the `lookup.csv` for a dataset can be used to map data from different organisations to the same entity, or the `old-entity.csv` can be used to redirect information from one entity to another.

A how-to on merging entities can be found [here](Merge-entities)
Read the duplicate scenario in the [adding data](../Tutorials/Adding-Data.md) tutorial page, and the [how-to merge entities](../How-To-Guides/Maintaining/Merge-entities.md) page to learn more.

## Retiring data

Expand All @@ -59,23 +55,27 @@ When an endpoint consistently fails, or LPAs give us a different endpoint (as op

When we retire an endpoint, we also need to retire the source(s) associated with it as sources are dependent on endpoints.

Read [how-to retire an endpoint](Retire-endpoints) to learn more.
Read [how-to retire an endpoint](../How-To-Guides/Retiring/Retire-endpoints.md) to learn more.

### Retiring resources

It won’t be necessary to do this step often, however, sometimes a resource should not continue to be processed and included in the platform. This can be for multiple reasons, and in most cases will occur when it has been found that the resource contains significant errors.

A how-to on retiring resources can be found [here](Retire-resources)
Read [how-to retire a resource](../How-To-Guides/Retiring/Retire-resources.md) to learn more.

### Retiring entities

**Note:** We may want to keep old entities on our platform as historical data. There are two reasons an entity might be removed:
**Note:** We usually want to keep old entities on our platform as historical data.

> **For example** a World Heritage Site was added as an entity to our platform. Although it is no longer a World Heritage Site, we want to retain the entity to indicate that it held this status during a specific period.
However, there are two situations when an entity might be removed:

1. It was added in error. In this case, we should remove it from our system.
2. It has been stopped for some reason. In this scenario, we should retain the entity.
Ideally, we would mark such entities with end-dates to indicate they have been stopped, but implementing this requires additional work.

For example, a World Heritage Site was added as an entity to our platform. Although it is no longer a World Heritage Site, we want to retain the entity to indicate that it held this status during a specific period.

Ideally, we would mark such entities with end-dates to indicate they have been stopped, but implementing this requires additional work.

In a given scenario, determine the reason why the entities are no longer present.
Check with Swati before deleting entities.
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
make add-data COLLECTION=conservation-area INPUT_CSV=import.csv
```
1. **(Optiona) Update entity-organisation.csv**
1. **(Optional) Update entity-organisation.csv**
If the data that has been added is part of the `conservation-area` collection e.g `conservation-area` and `conservation-area-document`, the entity range must be added as a new row. This is done using the entities generated in `lookup`. Use the first and the last of the entity numbers of the newly generated lookups e.g if `44012346` is the first and `44012370` the last, use these as `entity-minimum` and `entity-maximum`.
Expand All @@ -105,11 +105,15 @@
1. **Test locally**
Once the changes have been made and pushed, the next step is to test locally if the changes have worked. Follow the steps in [building a collection locally](..\Testing\Building-a-collection-locally.md)
1. **Push changes**
Use git to push changes up to the repository, each night when the collection runs the files are downloaded from here. It is a good idea to name the commit after the organisation you are importing.
1. **Push changes**
Commit your changes to a new branch that is named after the organisation whose endpoints are being added (use the 3 letter code for succinct names, e.g. `add-LBH-data`).
Push the changes on your branch to remote and create a new PR. This should be reviewed and approved by a colleague in the Data Management team before being merged into `main`.
Once the chages are merged they will be picked up by the nightly Airflow jobs which will build an updated dataset.
1. **Run action workflow (optional)**
Optionally, you can manually execute the workflow that usually runs overnight yourself - if you don’t want to wait until the next day - to check if the data is actually on the platform. Simply follow the instructions in the [guide for triggering a collection manually](/data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually).
Optionally, if you don’t want to wait until the next day, you can manually execute the workflow that usually runs overnight yourself in order to be able to check if the data is actually on the platform. Simply follow the instructions in the [guide for triggering a collection manually](/data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually).
## Endpoint edge-cases
Expand Down
Loading

0 comments on commit dffe584

Please sign in to comment.