Ees 4945 create data set version mappings 3 #5000

duncan-at-hiveit · 2024-06-24T16:31:45Z

Overview

This PR handles the creation of the mapping "diffs" that let us map specific metadata elements of one data set version to the next. Currently these are Locations and Filters / Filter Options.

The original design of this diff structure was proposed in this document. It has changed slightly during the implementation here but not greatly.

In order to do this, this PR:

Adds a new "CreateMappings" function to create initial empty mappings from an original to a target data set version.
Adds a new "ApplyAutoMapping" function to apply any auto-mapping that we're able to perform before handing over to the user for any manual mapping activity.

JSON example of generated mappings

Locations

[
  {
    "Level": "LocalAuthority",
    "Mappings": [
      {
        "Type": "AutoMapped",
        "Source": {
          "Label": "LA location 1 label",
          "Key": "LA location 1 key"
        },
        "CandidateKey": "LA location 1 key"
      }
    ],
    "Candidates": [
      {
        "Label": "LA location 1 label",
        "Key": "LA location 1 key"
      },
      {
        "Label": "LA location 2 label",
        "Key": "LA location 2 key"
      }
    ]
  },
  {
    "Level": "RscRegion",
    "Mappings": [],
    "Candidates": [
      {
        "Label": "LA location 1 label",
        "Key": "RSC location 1 key"
      }
    ]
  }
]

Filters

{
  "Mappings": [
    {
      "OptionMappings": [
        {
          "Type": "AutoMapped",
          "Source": {
            "Label": "Filter 1 option 1 label",
            "Key": "Filter 1 option 1 key"
          },
          "CandidateKey": "Filter 1 option 1 key"
        }
      ],
      "Type": "AutoMapped",
      "Source": {
        "Label": "Filter 1 label",
        "Key": "Filter 1 key"
      },
      "CandidateKey": "Filter 1 key"
    }
  ],
  "Candidates": [
    {
      "Label": "Filter 1 label",
      "Options": [
        {
          "Label": "Filter 1 option 1 label",
          "Key": "Filter 1 option 1 key"
        },
        {
          "Label": "Filter 1 option 2 label",
          "Key": "Filter 1 option 2 key"
        }
      ],
      "Key": "Filter 1 key"
    },
    {
      "Label": "Filter 2 label",
      "Options": [
        {
          "Label": "Filter 2 option 1 label",
          "Key": "Filter 2 option 1 key"
        }
      ],
      "Key": "Filter 2 key"
    }
  ]
}

Auto-mapping implementation

The auto-mapping process in this PR is very simple. When the empty mappings are first created in the CreateMappingsFunction, each mappable element from the source and the target data set versions is allocated a "Key", which is unique enough to identify it within its context in the JSON structure. For example, the Filter Option of "Free school meals" would be enough to uniquely identify that Filter Option within the context of its owning Filter, so it can use this as a Key, whereas Location Options require the use of any applicable Codes as well as a label to guarantee being able to uniquely identify it, and so we use the LocationMeta's RowKey for a Location Option's Key value.

The auto-mapping process then visits each source element that needs to be mapped, and looks for a candidate element with a matching Key as itself. This again is done within the correct context of the source element. For example, when looking to map the Local Authority of "Sheffield", we would only be seeking a candidate Location Option with Key "Sheffield" from within the Local Authorities of the target data set version. Similarly, when seeking a matching key for a Filter Option, we would only be looking at the Filter Options that belong to a Filter that has been mapped already to the source Filter Option's owning Filter.

MappingType.None vs AutoNone vs ManualNone?

When the mapping structure is first created, prior to running automappings, all of the potential mappings begin with a value of "None". This indicates that nothings been attempted with them yet.

When automappings have been run, the service will map all mappings to either AutoMapped where it can find a likely candidate, or AutoNone where it's found no likely candidate. We don't treat the AutoNone mappings as being "completed" mappings, but rather wait until the user confirms them at which point they become "ManualNone" and we consider this to be a "completed" mapping. The action in the UI is the "No mapping" link in the prototypes here:

Naming

Source and target data set versions

I refer in the code to "source" data set versions and "target" data set versions when inside the mapping code. Elsewhere these are generally referred to as "initial" and "next" versions, but source and target felt more natural in terms of the function of this code to me.

Source element

Source elements are mappable entities from the original data set version to the next. A source element could be a particular Location from the original (source) data set version for example, e.g. the Local Authority called "Sheffield".

Mappings

A mapping is a "Source" element e.g. "Sheffield", a "CandidateKey" which tells us the ID of another Location that we've found as an appropriate candidate to map to, and a "Type" which tells us if this was an automatic or a manual mapping.

Candidates

A candidate is an element from the target data set version that could be the target for a mapping e.g. the Local Authority "Sheffield" in the target version could be a candidate for a mapping of any Local Authority Locations in the source data set version.

Diffs to plans naming convention

I've moved away from "diffs" in favour of "plans". The backend and the user work to build a mapping plan for each facet of the data set versions that can be mapped (Locations / Filters / Filter Options) and at the end of the process, the plan will be carried out!

JSON structure differences from original proposal

Location mappings and candidates now grouped under their respective geographic levels

In the original proposal, we had the top-level location diff structure modelled as below, where mappings and targets (now candidates) were on the same level and each one had many elements under them grouped by geographic level:

{
    "mappings": [
        {
            "level": "LocalAuthority",
            "options": [location1, location2, location3]
        },
        {
            "level": "RSCRegion",
            "options": [region1, region2, region3]
        }
    ],
    "targets": [
        {
            "level": "LocalAuthority",
            "options": [location1, location2, location3]
        },
        {
            "level": "RSCRegion",
            "options": [region1, region2, region3]
        }
    ],
}

I've now flipped this so that "mappings" and "candidates" (formerly "targets") are grouped under their respective geographic levels thusly:

[
    {
        "level": "LocalAuthority",
        "mappings": [location1, location2, location3],
        "candidates": [location1, location2, location3]
    },
    {
        "level": "RSCRegion",
        "mappings": [region1, region2, region3],
        "candidates": [region1, region2, region3]
    }
]

Whilst working in amongst it, it just felt like it made more sense to have everything encapsulated in a level-by-level basis like this.

…d DataSetVersionMappingService implementation and tests.

…oNone if the service detects no likely candidates, and switched logic of completed mappings to indicate that AutoNone mappings are incomplete until the user confirms them.

… bring PublicDataDbContextModelSnapshot up-to-date

…keys rather than Lists that also contained elements with unique keys, to more easily work with JSON paths and JSON partial updates

…le elements, as currently these mostly only contain a single Label field. Updated PublicDataDbContextModelSnapshot to reflect JSON field mapping simplifications.

src/GovUk.Education.ExploreEducationStatistics.Public.Data.Model/DataSetVersionMapping.cs

...on.ExploreEducationStatistics.Public.Data.Processor/Services/DataSetVersionMappingService.cs

jack-hive · 2024-06-26T11:43:00Z

...on.ExploreEducationStatistics.Public.Data.Processor/Services/DataSetVersionMappingService.cs

+            .ForEach(filterMapping => AutoMapParentAndOptions(
+                parentMapping: filterMapping,
+                parentCandidates: filtersPlan.Candidates,
+                candidateOptionsSupplier: autoMappedCandidate => autoMappedCandidate.Options));


I was going to say maybe we could do away with the candidateOptionsSupplier argument, as you will always be getting the options on the candidate. But are we doing this to keep the flexibility for when we start mapping other facets, which may or may not have the same JSON structure?

Yeah it's useful for FilterOptions because in order to work out which candidates are valid, you firstly have to see which Filter candidate is mapped to the FilterOption's owning Filter, and then draw the available FilterOption candidates from that. It's simpler with LocationOptions, because both Mappings and Candidates are grouped under their GeoLevels, so we already know which candidates should be available immediately.

makes sense!

jack-hive · 2024-06-26T11:50:07Z

...on.ExploreEducationStatistics.Public.Data.Processor/Services/DataSetVersionMappingService.cs

+
+        if (matchingCandidate is not null)
+        {
+            mapping.CandidateKey = matchingCandidate.Key;


This line got me thinking...

When we do the automapping, and we find a potential match, this CandidateKey is always just going to be identical to the source key. So it almost seems pointless setting it. You have the same amount of information by not having a CandidateKey at all and just having Type set to AutoMapped.

However, I'm guessing you want this property CandidateKey because the user might change it manually - in which case it would no longer be identical?

But if that's the case, then I'm wondering if it should then be called TargetKey rather than CandidateKey. Because 'Candidate' implies 'this could be an option'. Whereas, 'Target' implies 'this is the current option I have set it to, or it was automatically set to'. Just makes more sense to me I think?

At the same time, I get why you'd want to name it CandidateKey to keep it consistent with the naming of the Candidates array... And Candidates does feel like the right naming for that bit of the JSON

What do you think?

Yeah it could go either way tbh. You're right that we always set it to the same key as the source during automapping, but we allow the user to update it as well. We might have brainier strategies that look at other metadata in the future and don't rely solely on identical keys to perform an automapping.

I could go either way namewise, but probably err on CandidateKey just so it's clearer what we're mapping to if that's OK with you?

Yep, all good with me! :) I could also go either way

jack-hive · 2024-06-26T12:10:26Z

...onStatistics.Public.Data.Processor.Tests/Functions/ProcessNextDataSetVersionFunctionTests.cs


+    public abstract class CreateMappingMiscTests(


can remove the abstract here

Also, wonder if we should stick the three test classes that extend CreateMappingsTests inside of CreateMappingsTests? Just so that the nesting makes it a bit easier to digest?

Same with the three test classes that extend ApplyAutoMappingsTests?

Update this one thanks. The remaining abstract ones in this class are all base classes for the test suites rather than test suites.

Sorry, for the second bit I was suggesting nesting the test classes inside of their abstract parent base class, just to make the discovery a little better? It will indent everything by 1 tab of course. Don't feel super strongly about it though

...onStatistics.Public.Data.Processor.Tests/Functions/ProcessNextDataSetVersionFunctionTests.cs

…tial mapping type for LocationOptions. Various simplifications and tidy-ups.

…-mappings-3' into EES-5113-investigate-partial-json-updates

…ay purposes in the front end during mapping process. Swapping Filter Key from being its Label to its column name (PublicId), again for the purposes of the displaying in the front end

…igate-partial-json-updates Ees 5113 investigate partial json updates

duncan-at-hiveit added 2 commits June 24, 2024 15:49

EES-4945 - adding model and migration for DataSetVersionMapping. Adde…

7192381

…d DataSetVersionMappingService implementation and tests.

EES-4945 - implemented automapping and tests

fe8ed0a

duncan-at-hiveit force-pushed the EES-4945-create-data-set-version-mappings-3 branch from bbe8b5a to fe8ed0a Compare June 24, 2024 21:09

EES-4945 - swapped assignment of MappingType.None for MappingType.Aut…

f46fe52

…oNone if the service detects no likely candidates, and switched logic of completed mappings to indicate that AutoNone mappings are incomplete until the user confirms them.

duncan-at-hiveit marked this pull request as ready for review June 25, 2024 09:13

EES-4945 - reran migration after DataSetVersionMapping refactoring to…

efa1578

… bring PublicDataDbContextModelSnapshot up-to-date

Base automatically changed from EES-4944-read-next-data-set-version-metadata to dev June 25, 2024 13:19

duncan-at-hiveit added 2 commits June 26, 2024 11:44

EES-5113 - remodelled JSON structure to use Dictionaries with unique …

1f795e3

…keys rather than Lists that also contained elements with unique keys, to more easily work with JSON paths and JSON partial updates

EES-5113 - additional change to add convenience constructor to mappab…

d8ea88a

…le elements, as currently these mostly only contain a single Label field. Updated PublicDataDbContextModelSnapshot to reflect JSON field mapping simplifications.

jack-hive requested changes Jun 26, 2024

View reviewed changes

duncan-at-hiveit added 4 commits June 26, 2024 14:59

EES-4945 - responding to PR comments. Fixing issue with incorrect ini…

988e1b2

…tial mapping type for LocationOptions. Various simplifications and tidy-ups.

Merge remote-tracking branch 'origin/EES-4945-create-data-set-version…

84bc7ee

…-mappings-3' into EES-5113-investigate-partial-json-updates

EES-4945 - capturing additional metadata of LocationOptions for displ…

f8ce130

…ay purposes in the front end during mapping process. Swapping Filter Key from being its Label to its column name (PublicId), again for the purposes of the displaying in the front end

EES-5113 - responding to PR comments.

4c7ab32

jack-hive approved these changes Jun 26, 2024

View reviewed changes

duncan-at-hiveit added 3 commits June 26, 2024 20:40

Merge pull request #5007 from dfe-analytical-services/EES-5113-invest…

688a15d

…igate-partial-json-updates Ees 5113 investigate partial json updates

EES-4945 - resolving merge issues before moving to dev

2706b0e

EES-4945 - removing SQL logging from tests

aca4187

duncan-at-hiveit merged commit 9e0037d into dev Jun 26, 2024
2 checks passed

duncan-at-hiveit deleted the EES-4945-create-data-set-version-mappings-3 branch June 26, 2024 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ees 4945 create data set version mappings 3 #5000

Ees 4945 create data set version mappings 3 #5000

duncan-at-hiveit commented Jun 24, 2024 •

edited

Loading

jack-hive Jun 26, 2024

duncan-at-hiveit Jun 26, 2024

jack-hive Jun 26, 2024

jack-hive Jun 26, 2024

duncan-at-hiveit Jun 26, 2024

jack-hive Jun 26, 2024

jack-hive Jun 26, 2024

jack-hive Jun 26, 2024

duncan-at-hiveit Jun 26, 2024

jack-hive Jun 26, 2024

Ees 4945 create data set version mappings 3 #5000

Ees 4945 create data set version mappings 3 #5000

Conversation

duncan-at-hiveit commented Jun 24, 2024 • edited Loading

Overview

JSON example of generated mappings

Locations

Filters

Auto-mapping implementation

MappingType.None vs AutoNone vs ManualNone?

Naming

Source and target data set versions

Source element

Mappings

Candidates

Diffs to plans naming convention

JSON structure differences from original proposal

Location mappings and candidates now grouped under their respective geographic levels

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duncan-at-hiveit commented Jun 24, 2024 •

edited

Loading