-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables #220
Conversation
…ract base class elements
Finished taking a pass at the code. The refactoring is easy to follow and makes sense to me. It is def much more intuitive and cleaner to have a separate statemanager and catalogmanager. This would make the cortex processor implementation much cleaner!! |
Tests are passing except one new issue with Windows paths. That will be fixed shortly, but it doesn't change anything in the core code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thank you for the PR walkthrough! 👍
/poetry-lock
|
/poetry-lock
|
This PR does a lot of things...
CatalogManager refactor
The role of catalog manager is refactored into the new classes, under the role of two "backends", two "providers", and a "writer" (for state).
StateBackend
- Responsible for serializing and deserializing state in the internal SQL tables.CatalogBackend
- Responsible for serializing and deserializing catalog info in the internal SQL tables.StateProvider
- Provides state inputs to anything that needs state - namely toSource
objects, which use the state to know where to start their sync.CatalogProvider
- Provides catalog schema info to any method that needs access to catalog metadata.StateWriter
- There are two implementations:StdOutStateWriter
Performs the role as a destination would - which is to simply print the state message toSTDOUT
.SqlStateWriter
- Writes state messages to the internal SQL table, which is how a Cache should behave.With these class in mind, the "Backend" classes generally are able to create "provider" classes when they are the source of the state or catalog. Also, much of the parsing logic for how to handle catalogs was able to be moved out of the processor and cache classes, and into the
CatalogProvider
class - for instance, getting the primary keys of a stream or getting a stream's properties._future_cdk
module. These would be internal classes anyway, but the goal is to move these abstract implementations upstream so destinations can depend upon them.Decoupling
SQLProcessor
classes fromSQLBackend
andCache
classes.We want to be able to use
SQLProcessor
classes in destinations like the Cortext destination, and so this PR decouplesSQLProcessor
fromCache
andBackend
classes (previously theCatalogManager
). Now, instead of passing a backend to theSQLProcessor
, we simply pass aCatalogProvider
andStateWriter
. When a state provider is not explicitly created, theSQLProcessor
class will just create its ownStdOutStateWriter
class, and will behave like a destination. (See the Cortext sample script for an example that doesn't require a cache.)One last change was that, since we don't want
SQLProcessor
to depend onCache
class, we needed a different way to provide the inputs that previously were mapped to an embeddedcache
property in theSQLProcessor
classes. Now, we decouple the functionality of configuration into a new set ofSQLConfig
classes - these are basically the just the user inputs that would be provided via theCache
, but without the behavioral traits of a fullCache
class.SQLConfig
objects know their properties and they know how to createSQLAlchemy
connections (as well as 3rd-party vendor connections, when applicable), but they are lightweight and can be handed down to the SQLProcessor classes. They can also be imported or implemented by destinations, without the need to create a full cache class.For backwards compatibility and ease-of-use, the
Cache
classes inherit from the respectiveSQLConfig
classes, so they accept the same config inputs in their constructor as they did previous to this PR (no breaking change) and they can pass themselves to theSQLProcessor
classes as a valid subclass ofSQLConfig
. Typing prevents the processor from performing cache-related functions on theSQLConfig
instances, even though the object may also be aCache
. This ensures that we don't end up with any circular dependencies between the role of theCache
and theSQLProcessor
, and it also ensures that destination connectors that use theSQLProcessor
class can send the lighter-weightSQLConfig
object to theSQLProcessor
initializer, and they won't need to create a fullCache
object.Eligible for CDK Re-Use
RecordProcessor
SQLProcessor
CatalogProvider
StateProvider
StateWriter
StdOutStateWriter