Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization of database for each resource/platform #16

Closed
josvandervelde opened this issue Mar 22, 2023 · 0 comments
Closed

Synchronization of database for each resource/platform #16

josvandervelde opened this issue Mar 22, 2023 · 0 comments

Comments

@josvandervelde
Copy link

josvandervelde commented Mar 22, 2023

For each platform/resource combination (so for each ResourceConnector),
we want to execute a separate script that makes sure that the db of our Metadata Catalogue is in sync with the metadata of this platform. We want to execute this script every X seconds (probably minutes).

Main considerations:

  1. It should be easy to implement a new connector
  2. It should be easy to monitor the connectors
  3. It should be easy to retry synchronizing resources that threw an error

This could be placed in a separate repository. To keep it simple, let's keep it in the current repo for now

  • Create src/connectors/synchronization.py
    • It should expect command line arguments:
      • from: datetime | None - only relevant for the first run. The first run will start with this datetime.
      • connector: str - the path to the ResourceConnector
      • connect-db (either connect-db or connect-url must be present) - if present, the database of the Metadata Catalogue will be updated directly
      • connect-url (either connect-db or connect-url must be present): str - if present, the Metadata Catalogue will be updated using the REST API.
      • working-dir: str - a path. This will contain a subdirectory for this ResourceConnector with
        • the logging
        • a .csv with the failed resources (datetime, identifier, reason)
        • a file containing the next datetime from which the data should be retrieved
    • When it's running, it should run from the datetime given by the working-dir/last-datetime file
      (or if not existant, from the cmd line argument) to the current datetime.
      If this is the first run, the run should be split into batches.
  • Update the resource connectors.
    • They should have a fetch method with parameters from_incl and to_excl
    • The fetch should return same as now, but also possibly a FailedResource with identifier and reason
    • They should have a retry method with parameter identifier
  • create a simple synchronize.sh, that:
    • will be run from a crontab
    • takes the same arguments as the .py
    • adds a file lock, making sure that this process is not called multiple times
    • calls the .py
    • logs into the working directory. This should be a separate log, containing only "Starting run" "Run ended" and "Run not possible because another process is already running"
  • Create src/connectors/retry.py (can be left out of scope for the first PR)
    • This should retry all failures that are present in the failed resources .csv.
    • command line arguments
      • connector
      • connect-db
      • connect-url
      • working-dir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant