Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-1] BugFix: manifest submission when data previously annotated without manifest upload #1538

Merged
merged 9 commits into from
Nov 13, 2024

Conversation

GiaJordan
Copy link
Contributor

@GiaJordan GiaJordan commented Nov 11, 2024

Issue

Resolves SCHEMATIC-1 which occurs when users:

  • Have a dataset/top level folder on synapse with 50 or more files
  • generate a manifest with schematic/DCA and fill it out
    • possibly validate the manifest with schematic
  • Use the manifest to annotate the data with a client other than schematic ie. R
  • Try to generate a new manifest with schematic using the existing annotations from the data on synapse

Changes

  • when a temporary file view is created with DatasetFileView and the resulting view is "tidied", any existing eTag columns will be removed before the new values are stored in that column.
  • Test resources on synapse have been created to replicate the issue and an integration test has been added. It was not necessary to include 50+ files as we're testing DatasetFileView directly instead of through manifest generation

Notes

  • Technically using --use_annotations to generate a manifest for a dataset where there isn't already a manifest present is not supported, but this provides a way for users to get back on the recommended data flow path instead of having to start all over.
  • DatasetFileView is only used in this one case, gathering annotations for files in a dataset with more than 50 files. With the ability we have now to scope fileview queries in the SynapseStorage object and get the annotations for all files in a dataset we should consider whether we should deprecate the DatasetFileView class.
  • We should also consider whether there's anything else preventing us from officially supporting using the --use_annotations feature when there is no manifest in the dataset since this change enables it in at least this case.

@GiaJordan GiaJordan requested a review from a team as a code owner November 11, 2024 23:01
@GiaJordan GiaJordan marked this pull request as draft November 11, 2024 23:01
@GiaJordan GiaJordan marked this pull request as ready for review November 12, 2024 17:44
@thomasyu888 thomasyu888 changed the title BugFix: manifest submission when data previously annotated without manifest upload [SCHEMATIC-1] BugFix: manifest submission when data previously annotated without manifest upload Nov 12, 2024
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @GiaJordan , I'll defer to @BryanFauble and/or @andrewelamb for final review

Copy link
Collaborator

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thomasyu888 thomasyu888 merged commit f6d015b into develop Nov 13, 2024
7 checks passed
@thomasyu888 thomasyu888 deleted the schematic-1-existing-eTag branch November 13, 2024 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants