Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables #220

Merged
merged 89 commits into from
May 15, 2024

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented May 8, 2024

This PR does a lot of things...

CatalogManager refactor

The role of catalog manager is refactored into the new classes, under the role of two "backends", two "providers", and a "writer" (for state).

  • StateBackend - Responsible for serializing and deserializing state in the internal SQL tables.
  • CatalogBackend - Responsible for serializing and deserializing catalog info in the internal SQL tables.
  • StateProvider - Provides state inputs to anything that needs state - namely to Source objects, which use the state to know where to start their sync.
  • CatalogProvider - Provides catalog schema info to any method that needs access to catalog metadata.
  • StateWriter - There are two implementations:
    • StdOutStateWriter Performs the role as a destination would - which is to simply print the state message to STDOUT.
    • SqlStateWriter - Writes state messages to the internal SQL table, which is how a Cache should behave.

With these class in mind, the "Backend" classes generally are able to create "provider" classes when they are the source of the state or catalog. Also, much of the parsing logic for how to handle catalogs was able to be moved out of the processor and cache classes, and into the CatalogProvider class - for instance, getting the primary keys of a stream or getting a stream's properties.

  • Refactor CatalogManager to isolate StateManager functions in a new class.
  • Create abstract base classes of both of the above so we can decouple them from their SQL-based serialization implementations.
  • Create simple versions of these two classes that simply accept input from upstream.
  • Allow creation of SQLProcessor objects with State and Catalog artifacts provided explicitly, rather than being read/written to internal SQL tables.
  • Move classes that we want to move to the CDK into a new _future_cdk module. These would be internal classes anyway, but the goal is to move these abstract implementations upstream so destinations can depend upon them.

Decoupling SQLProcessor classes from SQLBackend and Cache classes.

We want to be able to use SQLProcessor classes in destinations like the Cortext destination, and so this PR decouples SQLProcessor from Cache and Backend classes (previously the CatalogManager). Now, instead of passing a backend to the SQLProcessor, we simply pass a CatalogProvider and StateWriter. When a state provider is not explicitly created, the SQLProcessor class will just create its own StdOutStateWriter class, and will behave like a destination. (See the Cortext sample script for an example that doesn't require a cache.)

One last change was that, since we don't want SQLProcessor to depend on Cache class, we needed a different way to provide the inputs that previously were mapped to an embedded cache property in the SQLProcessor classes. Now, we decouple the functionality of configuration into a new set of SQLConfig classes - these are basically the just the user inputs that would be provided via the Cache, but without the behavioral traits of a full Cache class. SQLConfig objects know their properties and they know how to create SQLAlchemy connections (as well as 3rd-party vendor connections, when applicable), but they are lightweight and can be handed down to the SQLProcessor classes. They can also be imported or implemented by destinations, without the need to create a full cache class.

For backwards compatibility and ease-of-use, the Cache classes inherit from the respective SQLConfig classes, so they accept the same config inputs in their constructor as they did previous to this PR (no breaking change) and they can pass themselves to the SQLProcessor classes as a valid subclass of SQLConfig. Typing prevents the processor from performing cache-related functions on the SQLConfig instances, even though the object may also be a Cache. This ensures that we don't end up with any circular dependencies between the role of the Cache and the SQLProcessor, and it also ensures that destination connectors that use the SQLProcessor class can send the lighter-weight SQLConfig object to the SQLProcessor initializer, and they won't need to create a full Cache object.

Eligible for CDK Re-Use

  • RecordProcessor
  • SQLProcessor
  • CatalogProvider
  • StateProvider
  • StateWriter
  • StdOutStateWriter

@bindipankhudi
Copy link
Contributor

Finished taking a pass at the code. The refactoring is easy to follow and makes sense to me. It is def much more intuitive and cleaner to have a separate statemanager and catalogmanager. This would make the cortex processor implementation much cleaner!!

@aaronsteers
Copy link
Contributor Author

Tests are passing except one new issue with Windows paths. That will be fixed shortly, but it doesn't change anything in the core code.

Copy link
Contributor

@bindipankhudi bindipankhudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thank you for the PR walkthrough! 👍

@aaronsteers aaronsteers enabled auto-merge (squash) May 15, 2024 16:57
@aaronsteers
Copy link
Contributor Author

aaronsteers commented May 15, 2024

/poetry-lock

poetry lock job started... Check job output.

poetry lock applied successfully.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented May 15, 2024

/poetry-lock

poetry lock job started... Check job output.

poetry lock applied successfully.

@aaronsteers aaronsteers merged commit e82d37c into main May 15, 2024
15 checks passed
@aaronsteers aaronsteers deleted the aj/feat/read-only-caches branch May 15, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants