-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EES-5097 Add Processor functions to create data set meta, import to DuckDb and export to Parquet data files #4880
Conversation
...eEducationStatistics.Public.Data.Processor/Functions/ProcessInitialDataSetVersionFunction.cs
Outdated
Show resolved
Hide resolved
...eEducationStatistics.Public.Data.Processor/Functions/ProcessInitialDataSetVersionFunction.cs
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Public.Data.Processor/Models/MetaFileRow.cs
Outdated
Show resolved
Hide resolved
....Education.ExploreEducationStatistics.Public.Data.Api.Tests/appsettings.IntegrationTest.json
Show resolved
Hide resolved
...eEducationStatistics.Public.Data.Processor/Functions/ProcessInitialDataSetVersionFunction.cs
Outdated
Show resolved
Hide resolved
....Education.ExploreEducationStatistics.Public.Data.Processor/Repository/LocationRepository.cs
Outdated
Show resolved
Hide resolved
....Education.ExploreEducationStatistics.Public.Data.Processor/Repository/LocationRepository.cs
Outdated
Show resolved
Hide resolved
...Uk.Education.ExploreEducationStatistics.Public.Data.Processor/Services/DataSetMetaService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Public.Data.Scripts/Commands/SeedDataCommand.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Public.Data.Scripts/Commands/SeedDataCommand.cs
Outdated
Show resolved
Hide resolved
5f231b6
to
4837b32
Compare
src/GovUk.Education.ExploreEducationStatistics.Public.Data.Model/DataSetVersionImportStage.cs
Outdated
Show resolved
Hide resolved
...tatistics.Public.Data.Processor.Tests/Functions/ProcessInitialDataSetVersionFunctionTests.cs
Outdated
Show resolved
Hide resolved
string[] expectedDataSetVersionFiles = | ||
[ | ||
_dataSetVersionPathResolver.CsvDataPath(dataSetVersion), | ||
_dataSetVersionPathResolver.CsvMetadataPath(dataSetVersion), | ||
_dataSetVersionPathResolver.DuckDbPath(dataSetVersion) | ||
]; | ||
|
||
var actualDataSetVersionFiles = Directory | ||
.GetFiles(_dataSetVersionPathResolver.DirectoryPath(dataSetVersion)) | ||
.Select(Path.GetFullPath) | ||
.ToArray(); | ||
|
||
Assert.Equal(expectedDataSetVersionFiles.Length, actualDataSetVersionFiles.Length); | ||
Assert.All(expectedDataSetVersionFiles, filePath => Assert.Contains(filePath, actualDataSetVersionFiles)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think these assertions are particularly relevant to these tests?
Guessing we're probably going to do add more checks in EES-5152 on the actual imported metadata, but these assertions just seem like noise in the test so would be tempted to cut them out.
Similar applies for some of the other stage tests below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes planning to add some checks on the actual imported metadata/data in EES-5152.
I feel like these assertions can be justified as the intention is really to check that no additional files have crept in and that no files have been deleted yet.
Creating the filename constants has allowed this to be tidied up a fair bit. I've added a method AssertDataSetVersionDirectoryContainsOnlyFiles
which can be used like this, e.g. in the WriteDataFileTests
:
AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion, _allDataSetVersionFiles);
or in earlier stage tests:
AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion,
[
DataSetFilenames.CsvDataFile,
DataSetFilenames.CsvMetadataFile,
DataSetFilenames.DuckDbDatabaseFile
]);
or when checking the duck db file is the only file deleted on completion:
AssertDataSetVersionDirectoryContainsOnlyFiles(dataSetVersion,
_allDataSetVersionFiles
.Where(file => file != DataSetFilenames.DuckDbDatabaseFile)
.ToArray());
What do you think to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, can't say I necessarily agree that we need to check them, but it's not hurting and the new method makes a bit nicer anyway. Let's go with it!
...tatistics.Public.Data.Processor.Tests/Functions/ProcessInitialDataSetVersionFunctionTests.cs
Outdated
Show resolved
Hide resolved
...reEducationStatistics.Public.Data.Processor/Extensions/TaskOrchestrationContextExtensions.cs
Outdated
Show resolved
Hide resolved
...reEducationStatistics.Public.Data.Processor/Extensions/TaskOrchestrationContextExtensions.cs
Outdated
Show resolved
Hide resolved
...reEducationStatistics.Public.Data.Processor/Extensions/TaskOrchestrationContextExtensions.cs
Outdated
Show resolved
Hide resolved
...eEducationStatistics.Public.Data.Processor/Functions/ProcessInitialDataSetVersionFunction.cs
Outdated
Show resolved
Hide resolved
#pragma warning disable IDE0060 | ||
[ActivityTrigger] object? input, | ||
#pragma warning restore IDE0060 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in Slack, the pragmas are pretty horrible. Perhaps we should switch back to using the instanceId
as the activity trigger?
Same applies for the other functions in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched back to using the instanceId
as the parameter for the activity trigger function binding.
For my own benefit in case I come back to this in future I'm just documenting here that we should have been able to use the discard parameter like [ActivityTrigger] object? _,
instead of [ActivityTrigger] object? input,
but there's validation on the binding name resulting in a startup error The binding name _ is invalid. Please assign a valid name to the binding.
.
There doesn't seem to be a way of assigning a different name.
We could have also used this at a class level instead of the pragmas to supress the warning
[SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding")]
Or we could have targeted the suppressions on individual functions by adding something like this to AssemblyInfo.cs
:
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.CompleteProcessing(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.HandleProcessingFailure(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.ImportMetadata(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.ImportData(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
[assembly: SuppressMessage("Style", "IDE0060:Remove unused parameter", Justification = "Unused parameter is required by functions binding", Scope = "member", Target = "~M:GovUk.Education.ExploreEducationStatistics.Public.Data.Processor.Functions.ProcessInitialDataSetVersionFunction.WriteDataFiles(System.Object,System.Guid,System.Threading.CancellationToken)~System.Threading.Tasks.Task")]
I found these issues relevant to this problem:
https://github.com/MicrosoftDocs/azure-docs/issues/48735
Azure/azure-functions-dotnet-worker#323
…at the files directory holds more than just parquet files. Change base path from data/public-api-parquet to data/public-api-data
…ows before replacing lines in load.sql
…qlBuilder with a (non-interpolated) string command
…ngBuilder.InMemoryConnectionString constant
…o DuckDb and export parquet data files
This PR adds new activity functions to the Processor's durable function orchestration for processing an initial data set version:
ImportMetadata
- Uses the metadata and data CSV files to create the filter, indicator, location, geographic level and time period meta data for a data set version. It then uses this data set version meta to create the DuckDb filter, indicator, location and time period tables.ImportData
- Uses the data CSV file along with the data set version meta and the DuckDb meta tables to create the DuckDb data table.ExportData
- Exports the DuckDb database in Parquet format to the data set version file directory and strips absolute paths to the data directory inload.sql
.Unlike
SeedDataCommand.Seeder
, the processor functions use a DuckDb file data source to provide data persistence between activity function executions so we can separate out stages into different functions. TheCompleteProcessing
function is altered to remove the DuckDb database file after the export to Parquet is complete.Other changes
ParquetFilesOptions
toDataFilesOptions
to reflect that the files directory holds more than just Parquet files.SeedDataCommand.Seeder
fromdata/public-api-parquet
todata/public-api-data
to reflect that the files directory holds more than just Parquet files.DuckDbConnection
extension method to allow creating anSqlBuilder
with a (non-interpolated) string command.SeedDataCommand.Seeder
to use theSqlBuilder
API.load.sql
inSeedDataCommand
to fix issue seen in Windows where the directory path doesn't match the forward slashes used inload.sql
by the DuckDb database export.DuckDbConnection
in-memory data source string with theDuckDBConnectionStringBuilder.InMemoryConnectionString
constant.CopyCsvFilesFunctionTests
to upload real source csv data and metadata files to blob storage from theResources/DataFiles/AbsenceSchool
directory.