You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently encountered an issue in Datahub, where hard deleting cadet containers removed the IsPartOf relationship from datasets, with subsequent recreation of said containers and reingestion of datasets not recreating the lost relationships.
This has led us to understand more about these relationships and how they are created. They are not persisted in the same location as the majority of the metadata aspects, the postgres metadata_aspects_v2 table. They are part of the elasticsearch graph index and it appears are not recreated when a dataset already exists.
The standard job template as provided by Datahub reindexes every row of the metadata_aspects_v2 table in batches of 1000. This takes a really long time and when we tried to run with default config we hit OOM errors and the reindex failed.
This has been tested by amending the cronjob datahub-datahub-restore-indices-job-template adding the following arg values:
- -a
- batchSize=800
- -a
- urnLike=urn:li:dataset:(urn:li:dataPlatform:dbt,cadet%
- -a
- urnBasedPagination=true
This ran a lot quicker but did not appear to complete successfully... A bit of further investigation showed this is a bug that datahub are aware of and have applied a fix datahub-project/datahub#11305 but is not yet released
We should look to create a job template we can run from a gha workflow to be able to target the cadet datasets for a reindex
The text was updated successfully, but these errors were encountered:
Context
We recently encountered an issue in Datahub, where hard deleting cadet containers removed the
IsPartOf
relationship from datasets, with subsequent recreation of said containers and reingestion of datasets not recreating the lost relationships.This has led us to understand more about these relationships and how they are created. They are not persisted in the same location as the majority of the metadata aspects, the postgres
metadata_aspects_v2
table. They are part of the elasticsearch graph index and it appears are not recreated when a dataset already exists.The fix we have found for this is to reindex elasticsearch (or aws opensearch in our case) using the restore indices kubernetes job, see https://datahubproject.io/docs/how/restore-indices/#kubernetes
The problem
The standard job template as provided by Datahub reindexes every row of the
metadata_aspects_v2
table in batches of1000
. This takes a really long time and when we tried to run with default config we hit OOM errors and the reindex failed.Proposal
It is possible to configure some options for the restore indices, see job https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/README.md.
This has been tested by amending the cronjob datahub-datahub-restore-indices-job-template adding the following arg values:
This ran a lot quicker but did not appear to complete successfully... A bit of further investigation showed this is a bug that datahub are aware of and have applied a fix datahub-project/datahub#11305 but is not yet released
We should look to create a job template we can run from a gha workflow to be able to target the cadet datasets for a reindex
The text was updated successfully, but these errors were encountered: