Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ees 4944 read next data set version metadata #4952

Merged
merged 15 commits into from
Jun 25, 2024

Conversation

duncan-at-hiveit
Copy link
Collaborator

@duncan-at-hiveit duncan-at-hiveit commented Jun 12, 2024

Update 19/06/2024

Based on feedback from Nick, we've decided to go with an in-memory-only extraction of metadata from the next DataSetVersion rather than storing them in the database. As a result of this, the following changes have changed since originally raising this PR:

  • We have a new function called "CreateMappings" - this in turn reads all the appropriate meta for generating the diffs and the meta summary.
  • In order to gather the required metadata, each BlahMetaRepository in use now has a Read* and a Create* method which under the hood use a bunch of shared code for extracting the metadata.
  • CompleteProcessingFunction is no longer broken out as a separate shared function - instead, it's moved back into ProcessInitialDataSetVersionFunction, and a new CompleteNextDataSetVersionMappingsProcessing function is broken out.

Overview

This PR covers EES-4944 (the importing of the next data set version's metadata) and EES-4953 (the Admin API endpoint from which to initiate this action).

Behaviour

Importing

The import process begins in the same way as the initial DataSet workflow:

  1. The Subject CSVs are copied to the file share.
  2. The Metadata is imported.

Then it skips the import of the data itself and jumps straight to completion whereupon the DataSetVersionImport is marked as completed and the DataSetVersion set to its next state, which in this case is "Mapping".

The import process is initiated from the Admin endpoint POST /api/public-data/data-set-versions which takes a DataSetId and the ReleaseFileId for the next candidate version.

Validation

This PR adds validation to the choice of ReleaseFile when it receives the import request from Admin. We'd expect most / all of this to be handled by the dropdown that shows the available Subjects that can be chosen for the next DataSetVersion, but this extra validation is done down in the importer just in case:

There is some shared validation from the initial version import:

  1. The chosen ReleaseFile must not already be in use by a DataSet.
  2. The chosen ReleaseFile must come from a ReleaseVersion that is currently in Draft.
  3. The chosen ReleaseFile must be of type "Data".

And there is some new validation specific to the "next" version import:

  1. The chosen ReleaseFile must belong to a Release of the same Publication as the DataSet.
  2. The chosen ReleaseFile must not belong to the same Release as any of the DataSet's currrent Versions (i.e. you can't add a version "1.1" from a file on the 2021 Release if version "1.0" also lives on the 2021 Release - they must be from another Release entirely and never from any of its amendments).
  3. The chosen DataSet must already have a current live Version.

Refactoring

Breaking "CompleteProcessing" into its own file

As I've found that the CompleteProcessing stage seemed to be relevant in its current form for both the initial DataSetVersion and "next" Versions for now, I have separated it into its own class for reuse between the "initial" and "next" workflows. The only change in behaviour for now is the fact that initial versions end up in a "Draft" state whereas next versions end up in
a "Mapping" state. It may prove down the line that it's best separated out should the 2 workflows differ significantly for any reason but this is pretty trivial to do.

Ben and I talked about the potential for the Import Metadata Function to break into 2 parts which may also affect the CompleteProcessing Function, but we agreed not to try to tackle that now as it would cause issues on his testing branch.

Miscellaneous changes

Adding DbContext.RequireTransaction extension method

Given that more occurrences of the following code block were starting to appear around the codebase (including in this PR), I broke it out into a DbContext extension method for ease of maintenance and readability of the code.

        var strategy = context.Database.CreateExecutionStrategy();

        return await strategy.ExecuteAsync(
            async () =>
            {
                using var transactionScope = new TransactionScope(
                    TransactionScopeOption.Required,
                    new TransactionOptions {IsolationLevel = IsolationLevel.ReadCommitted},
                    TransactionScopeAsyncFlowOption.Enabled);
                
                // Do something with dbs here

                transactionScope.Complete();
            }); 

now becomes:

await context.RequireTransaction(async () => {
  // Do something with dbs here
});

I added a whole suite of tests around this that share a transaction amongst different DbContexts and tests the behaviour if shared transactions and rollback, the support for DbContexts that use EnableRetryOnFailure(), and nested transactions.

Adding EnableRetryOnFailure to all PostgreSQL DbContexts

As this option was available for setting up as we register DbContexts, I have added it to all of our PostgreSQL DbContexts.

@duncan-at-hiveit duncan-at-hiveit force-pushed the EES-4944-read-next-data-set-version-metadata branch 2 times, most recently from f8fe25e to dd3784f Compare June 13, 2024 13:55
@duncan-at-hiveit duncan-at-hiveit marked this pull request as ready for review June 13, 2024 14:05
@duncan-at-hiveit duncan-at-hiveit force-pushed the EES-4944-read-next-data-set-version-metadata branch 11 times, most recently from 7c8aeef to aa11d3e Compare June 19, 2024 12:23
@ntsim
Copy link
Collaborator

ntsim commented Jun 20, 2024

This branch really needs to be rebased onto dev as it's missing some significant new changes. Think there's probably a fair bit of merge conflict pain heading your way!

…e metadata of a new DataSetVersion, and an Admin endpoint for triggering it. Added initial validation to endpoint.
… tests for processing next DataSetVersion and the new separated CompleteProcessingFunction
…de to place code creating the DataSetVersions in the DataSetVersionService and leaving only the DataSet creation in the DataSetService.

Creating a convenience DbContext.RequireTransaction() method to handle transaction opening, completing and rollback that supports exception throwing code and Either-returning code in order to cause a rollback.

Added test suite covering the behaviour of transactions that are creaqted via this extension method, and added RetryOnFailure to all PostgreSQL DbContexts.
…the creation of mappings, instead reading metadata only
@duncan-at-hiveit duncan-at-hiveit force-pushed the EES-4944-read-next-data-set-version-metadata branch from aa11d3e to 906103f Compare June 20, 2024 09:24
…vertions! Better naming and messages, better return types on methods. Refactoring tests for easier reading.
}

public async Task<Either<ActionResult, Unit>> DeleteDataSetVersion(
Guid dataSetVersionId,
private async Task<Either<ActionResult, TResponse>> SendRequest<TResponse>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was speaking to Ben about this on one of my previous PR's. We were torn which way to go but in the end we decided not to break the handling of the HTTP response out into a common method - because it felt like each public method should be responsible for handling it's HTTP response individually, as the way it handles it may differ from endpoint to endpoint. For example, the DeleteDataSetVersion endpoint previously handled NotFound status codes; whereas now that case would fall into a 500.

I'm still torn which way I'd go on it personally. It does save a lot of code, that is mostly just reused. But at the same time it feels like each endpoint should be handled individually as there isn't technically anything to say they won't be different. They're just the same now by coincidence (aside from DeleteDataSetVersion like I mentioned)


[Function(ActivityNames.CompleteProcessing)]
public async Task CompleteProcessing(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor one, there's some whitespace crept in here into ProcessInitialDataSetVersionFunction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh - I'm not seeing what you're seeing I'm afraid. I have a method CompleteInitialDataSetVersionProcessing which has no additional whitespace under it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal but it's strange that you are not seeing this. The introduction of that line containing only whitespace is shown on Github when I look through the 'Files changed' list. It's line 70 directly above the line containing the annotation [Function(ActivityNames.CompleteInitialDataSetVersionProcessing)].

If I check out this branch locally and reformat this file it shows as a change which can be seen here:

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha ok got it ta

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

private async Task<List<LocationOptionMetaRow>> GetLocationOptionMetas(
IDuckDbConnection duckDbConnection,
DataSetVersion dataSetVersion,
CancellationToken cancellationToken,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint here - the cancellation token should be the last parameter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, amended thanks

@@ -11,6 +11,8 @@ namespace GovUk.Education.ExploreEducationStatistics.Public.Data.Model;

public class DataSetVersion : ICreatedUpdatedTimestamps<DateTimeOffset, DateTimeOffset?>
{
public const string FirstVersionString = "1.0";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd either make this private, or if we wanted to make it public (assuming we have a use for it then I'd instead create a public method which takes in versionMajor and versionMinor integers, and returns $"{versionMajor}.{versionMinor}". You can then use that method publicly to retrieve the string representation of whatever version you want, including "1.0", and you can also re-use it on line 72 to calculate the Version. That way we can ensure that the two are being formatted consistently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha. You know what, it was because I was originally using it in tests, but have since rolled that back out. I'm just going to return it to the way it was - no const.

Comment on lines 86 to 88
.AddScoped<IValidator<DataSetCreateRequest>, DataSetCreateRequest.Validator>()
.AddScoped<IValidator<NextDataSetVersionCreateRequest>,
NextDataSetVersionCreateRequest.Validator>()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this? I thought the .AddFluentValidation() line covered it?

Copy link
Collaborator Author

@duncan-at-hiveit duncan-at-hiveit Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove these lines I get:

System.InvalidOperationException : Unable to resolve service for type 'FluentValidation.IValidator`1[GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Requests.DataSetCreateRequest]' while attempting to activate 'GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.CreateDataSetFunction'

But then I added:

services.AddValidatorsFromAssemblyContaining<DataSetCreateRequest.Validator>();

and all was well :)

Good spot, thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... I'd expect .AddFluentValidation to do that for you. If you take a look under 'FluentValidationServiceCollectionExtensions' it seems like it should work with that 1 line alone :/

I'm not sure why it's failing

Copy link
Collaborator Author

@duncan-at-hiveit duncan-at-hiveit Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure tbh. We also do this in the Content API Startup.cs:

            services.AddFluentValidation();
            services.AddValidatorsFromAssemblyContaining<DataSetFileListRequest.Validator>();

Comment on lines 167 to 180
public async Task HttpClientNotFound_ReturnsNotFound()
{
var dataSetVersionId = Guid.NewGuid();

_mockHttp.Expect(HttpMethod.Delete, $"{Uri.AbsoluteUri}/{dataSetVersionId}")
.Respond(HttpStatusCode.NotFound);

var response = await _processorClient.DeleteDataSetVersion(dataSetVersionId);

_mockHttp.VerifyNoOutstandingExpectation();

var left = response.AssertLeft();
left.AssertNotFoundResult();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we happy to convert NotFounds into 500s?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reinstated this endpoint's special-case handling of 404s by allowing each endpoint to supply an optional "additionalResponseValidator" so that each endpoint can get first stab at deciding what to do with the response. 404s are now returning NotFoundResults for this endpoint alone, although I'd personally be tempted to consider doing this with all endpoints :)

Comment on lines 153 to 167
var subjectId = Guid.NewGuid();

var (releaseFile, releaseMetaFile) = DataFixture.DefaultReleaseFile()
.WithReleaseVersion(DataFixture.DefaultReleaseVersion()
.WithApprovalStatus(ReleaseApprovalStatus.Draft))
.WithFiles([
DataFixture
.DefaultFile(FileType.Data)
.WithSubjectId(subjectId),
DataFixture
.DefaultFile(FileType.Metadata)
.WithSubjectId(subjectId)
])
.GenerateList()
.ToTuple2();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe use your AddDataAndMetadataFiles method here to shorten?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah! Thanks, changed that

Comment on lines 128 to 171
var initialLocationMeta = DataFixture
.DefaultLocationMeta()
.WithDataSetVersionId(initialDataSetVersion.Id)
.ForIndex(0, s => s
.SetLevel(GeographicLevel.LocalAuthority)
.SetOptions(DataFixture
.DefaultLocationLocalAuthorityOptionMeta()
.GenerateList(2)
.Select(meta => meta as LocationOptionMeta)
.ToList()))
.ForIndex(1, s => s
.SetLevel(GeographicLevel.RscRegion)
.SetOptions(DataFixture
.DefaultLocationRscRegionOptionMeta()
.GenerateList(2)
.Select(meta => meta as LocationOptionMeta)
.ToList()))
.GenerateList();

var initialFilterMeta = DataFixture
.DefaultFilterMeta()
.WithDataSetVersionId(initialDataSetVersion.Id)
.WithOptions(() => DataFixture
.DefaultFilterOptionMeta()
.GenerateList(2))
.GenerateList(2);

var initialIndicatorMeta = DataFixture
.DefaultIndicatorMeta()
.WithDataSetVersionId(initialDataSetVersion.Id)
.GenerateList(2);

var initialTimePeriodMeta = DataFixture
.DefaultTimePeriodMeta()
.WithDataSetVersionId(initialDataSetVersion.Id)
.GenerateList(4);

await AddTestData<PublicDataDbContext>(context =>
{
context.LocationMetas.AddRange(initialLocationMeta);
context.FilterMetas.AddRange(initialFilterMeta);
context.IndicatorMetas.AddRange(initialIndicatorMeta);
context.TimePeriodMetas.AddRange(initialTimePeriodMeta);
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need any of this now since you've changed it to do the work in-memory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only reading the next data set version's meta in-memory, but the original data set version's meta is still read from the db as it's already in there and ready to go!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry

Do we do anything with it though? The test seems to pass if you remove all this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah of course, because this PR doesn't contain the "create mappings" work! Cool, removed here, but it's reinstated in the 4945 PR! Ta

Comment on lines 255 to 256
var dataSetVersionPathResolver = GetRequiredService<IDataSetVersionPathResolver>();
Directory.CreateDirectory(dataSetVersionPathResolver.DirectoryPath(nextDataSetVersion));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you need these 2 lines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% correct! Removed thanks

…equences into extracted DbContextTestExtensions.ClearTestData method
…ersion in response to PR comments, by adding additional validation parameter which allows each endpoint to be handled differently if required.
cancellationToken);
}

private async Task<Either<ActionResult, TResponse>> SendRequest<TResponse>(
Func<Task<HttpResponseMessage>> requestFunction,
Func<HttpResponseMessage, ActionResult?>? additionalResponseValidator = null,
Copy link
Collaborator

@jack-hive jack-hive Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd potentially rename this to additionalFailedResponseValidator or something similar to indicate that it's only used in non-success cases. But other than that, I think this looks good :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed to customResponseHandler to keep it open-ended and moved it so that it can potentially handle any and all responses, failures or otherwise. Doesn't have to be solely for validation anymore but could be used to do something custom with success cases as well.

@duncan-at-hiveit duncan-at-hiveit merged commit 42991af into dev Jun 25, 2024
3 of 7 checks passed
@duncan-at-hiveit duncan-at-hiveit deleted the EES-4944-read-next-data-set-version-metadata branch June 25, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants