Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear out of any previous iterations of data structures #70

Open
2 tasks done
tom-webber opened this issue Nov 20, 2023 · 1 comment
Open
2 tasks done

Clear out of any previous iterations of data structures #70

tom-webber opened this issue Nov 20, 2023 · 1 comment
Assignees

Comments

@tom-webber
Copy link
Contributor

tom-webber commented Nov 20, 2023

User Story

As a DaaP developer
I want the data lake buckets to be cleared out of any previous iterations of data structures
So that any data in the buckets is relevant to the current state of the catalogue

Definition of Done

Example

  • Delete items in the data and landing buckets that is versioned with minor versions rather than major versions, i.e. test_product/v1.0/ rather than test_product/v1/
  • Delete Glue databases in the awsdatacatalog data catalogue that belong to previous versions of Data Products but don't have the suffix _vX
@tom-webber
Copy link
Contributor Author

tom-webber commented Nov 20, 2023

There are old versions of s3 keys in the data buckets. If we want to completely purge data when we delete, we need to modify any deletion methods to also remove versions. Alternatively we could use the NoncurrentVersionExpiration action element of a lifecycle policy rule to delete previous versions of data automatically after X amount of time

Note: it is not possible to retrieve information about the glue databases with the sandbox account other than using the aws console due to lack of permissions:

botocore.errorfactory.AccessDeniedException: 
An error occurred (AccessDeniedException) when calling the GetDatabases operation: 
User: arn:aws:sts::013433889002:assumed-role/AWSReservedSSO_modernisation-platform-sandbox_73c5f933298caaac/tom-webber@digital.justice.gov.uk is not authorized to perform: 
glue:GetDatabases on resource: arn:aws:glue:eu-west-1:awsdatacatalog:catalog

loop for version deletion:

import re

import boto3

bucket_name = ""
s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)
folder = ""
key_prefix = ""
folder_key = folder + "/" + key_prefix
objects = bucket.object_versions.filter(Prefix=folder_key) # https://boto3.amazonaws.com/v1/documentation/api/1.28.1/reference/services/s3/bucket/object_versions.html
for object in objects: # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/objectversion/index.html
    old_v = re.search(r".*\/v\d[.]", object.key)
    if old_v:
        print(old_v.string)
        object.delete()

@tom-webber tom-webber self-assigned this Nov 20, 2023
@tom-webber tom-webber changed the title [gitmoji] <title> Clear out of any previous iterations of data structures Nov 20, 2023
@moj-data-platform-robot moj-data-platform-robot transferred this issue from ministryofjustice/analytical-platform Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

1 participant