Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EES-5097 Add Processor functions to create data set meta, import to DuckDb and export to Parquet data files #4880

Merged
merged 10 commits into from
May 31, 2024

Conversation

benoutram
Copy link
Collaborator

@benoutram benoutram commented May 22, 2024

This PR adds new activity functions to the Processor's durable function orchestration for processing an initial data set version:

  • ImportMetadata - Uses the metadata and data CSV files to create the filter, indicator, location, geographic level and time period meta data for a data set version. It then uses this data set version meta to create the DuckDb filter, indicator, location and time period tables.
  • ImportData - Uses the data CSV file along with the data set version meta and the DuckDb meta tables to create the DuckDb data table.
  • ExportData - Exports the DuckDb database in Parquet format to the data set version file directory and strips absolute paths to the data directory in load.sql.

Unlike SeedDataCommand.Seeder, the processor functions use a DuckDb file data source to provide data persistence between activity function executions so we can separate out stages into different functions. The CompleteProcessing function is altered to remove the DuckDb database file after the export to Parquet is complete.

Other changes

  • Rename ParquetFilesOptions to DataFilesOptions to reflect that the files directory holds more than just Parquet files.
  • Change the base files path used by the Public.Data.Api and Public.Data.Processor and projects and SeedDataCommand.Seeder from data/public-api-parquet to data/public-api-data to reflect that the files directory holds more than just Parquet files.
  • Add a new DuckDbConnection extension method to allow creating an SqlBuilder with a (non-interpolated) string command.
  • Convert the DuckDb queries in SeedDataCommand.Seeder to use the SqlBuilder API.
  • Convert backslash to a forward slash in the absolute version directory path before replacing lines in load.sql in SeedDataCommand to fix issue seen in Windows where the directory path doesn't match the forward slashes used in load.sql by the DuckDb database export.
  • Replace the default DuckDbConnection in-memory data source string with the DuckDBConnectionStringBuilder.InMemoryConnectionString constant.
  • Change CopyCsvFilesFunctionTests to upload real source csv data and metadata files to blob storage from the Resources/DataFiles/AbsenceSchool directory.

@benoutram benoutram force-pushed the EES-5097 branch 5 times, most recently from 5f231b6 to 4837b32 Compare May 31, 2024 08:53
@benoutram benoutram marked this pull request as ready for review May 31, 2024 08:53
@benoutram benoutram requested a review from ntsim May 31, 2024 08:56
Comment on lines 164 to 177
string[] expectedDataSetVersionFiles =
[
_dataSetVersionPathResolver.CsvDataPath(dataSetVersion),
_dataSetVersionPathResolver.CsvMetadataPath(dataSetVersion),
_dataSetVersionPathResolver.DuckDbPath(dataSetVersion)
];

var actualDataSetVersionFiles = Directory
.GetFiles(_dataSetVersionPathResolver.DirectoryPath(dataSetVersion))
.Select(Path.GetFullPath)
.ToArray();

Assert.Equal(expectedDataSetVersionFiles.Length, actualDataSetVersionFiles.Length);
Assert.All(expectedDataSetVersionFiles, filePath => Assert.Contains(filePath, actualDataSetVersionFiles));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think these assertions are particularly relevant to these tests?

Guessing we're probably going to do add more checks in EES-5152 on the actual imported metadata, but these assertions just seem like noise in the test so would be tempted to cut them out.

Similar applies for some of the other stage tests below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes planning to add some checks on the actual imported metadata/data in EES-5152.

I feel like these assertions can be justified as the intention is really to check that no additional files have crept in and that no files have been deleted yet.

Creating the filename constants has allowed this to be tidied up a fair bit. I've added a method AssertDataSetVersionDirectoryContainsOnlyFiles which can be used like this, e.g. in the WriteDataFileTests:

AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion, _allDataSetVersionFiles);

or in earlier stage tests:

AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion,
[
    DataSetFilenames.CsvDataFile,
    DataSetFilenames.CsvMetadataFile,
    DataSetFilenames.DuckDbDatabaseFile
]);

or when checking the duck db file is the only file deleted on completion:

AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion,
    _allDataSetVersionFiles
    .Where(file => file != DataSetFilenames.DuckDbDatabaseFile)
    .ToArray());

What do you think to this change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, can't say I necessarily agree that we need to check them, but it's not hurting and the new method makes a bit nicer anyway. Let's go with it!

Comment on lines 57 to 59
#pragma warning disable IDE0060
[ActivityTrigger] object? input,
#pragma warning restore IDE0060
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Slack, the pragmas are pretty horrible. Perhaps we should switch back to using the instanceId as the activity trigger?

Same applies for the other functions in this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched back to using the instanceId as the parameter for the activity trigger function binding.

For my own benefit in case I come back to this in future I'm just documenting here that we should have been able to use the discard parameter like [ActivityTrigger] object? _, instead of [ActivityTrigger] object? input, but there's validation on the binding name resulting in a startup error The binding name _ is invalid. Please assign a valid name to the binding..

There doesn't seem to be a way of assigning a different name.

We could have also used this at a class level instead of the pragmas to supress the warning

[SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding")]

Or we could have targeted the suppressions on individual functions by adding something like this to AssemblyInfo.cs:

[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.CompleteProcessing(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.HandleProcessingFailure(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.ImportMetadata(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.ImportData(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.WriteDataFiles(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]

I found these issues relevant to this problem:
https://github.com/MicrosoftDocs/azure-docs/issues/48735
Azure/azure-functions-dotnet-worker#323

@benoutram benoutram merged commit 684ac2d into dev May 31, 2024
7 checks passed
@benoutram benoutram deleted the EES-5097 branch May 31, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants