Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create restore cadet dataset indices workflow #408

Open
LavMatt opened this issue Dec 17, 2024 · 0 comments
Open

Create restore cadet dataset indices workflow #408

LavMatt opened this issue Dec 17, 2024 · 0 comments

Comments

@LavMatt
Copy link
Contributor

LavMatt commented Dec 17, 2024

Context

We recently encountered an issue in Datahub, where hard deleting cadet containers removed the IsPartOf relationship from datasets, with subsequent recreation of said containers and reingestion of datasets not recreating the lost relationships.

This has led us to understand more about these relationships and how they are created. They are not persisted in the same location as the majority of the metadata aspects, the postgres metadata_aspects_v2 table. They are part of the elasticsearch graph index and it appears are not recreated when a dataset already exists.

The fix we have found for this is to reindex elasticsearch (or aws opensearch in our case) using the restore indices kubernetes job, see https://datahubproject.io/docs/how/restore-indices/#kubernetes

The problem

The standard job template as provided by Datahub reindexes every row of the metadata_aspects_v2 table in batches of 1000. This takes a really long time and when we tried to run with default config we hit OOM errors and the reindex failed.

Proposal

It is possible to configure some options for the restore indices, see job https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/README.md.

This has been tested by amending the cronjob datahub-datahub-restore-indices-job-template adding the following arg values:

  - -a
  - batchSize=800
  - -a
  - urnLike=urn:li:dataset:(urn:li:dataPlatform:dbt,cadet%
  - -a
  - urnBasedPagination=true

This ran a lot quicker but did not appear to complete successfully... A bit of further investigation showed this is a bug that datahub are aware of and have applied a fix datahub-project/datahub#11305 but is not yet released

We should look to create a job template we can run from a gha workflow to be able to target the cadet datasets for a reindex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo 📝
Development

No branches or pull requests

1 participant