Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables #220

aaronsteers · 2024-05-08T01:49:47Z

This PR does a lot of things...

CatalogManager refactor

The role of catalog manager is refactored into the new classes, under the role of two "backends", two "providers", and a "writer" (for state).

StateBackend - Responsible for serializing and deserializing state in the internal SQL tables.
CatalogBackend - Responsible for serializing and deserializing catalog info in the internal SQL tables.
StateProvider - Provides state inputs to anything that needs state - namely to Source objects, which use the state to know where to start their sync.
CatalogProvider - Provides catalog schema info to any method that needs access to catalog metadata.
StateWriter - There are two implementations:
- StdOutStateWriter Performs the role as a destination would - which is to simply print the state message to STDOUT.
- SqlStateWriter - Writes state messages to the internal SQL table, which is how a Cache should behave.

With these class in mind, the "Backend" classes generally are able to create "provider" classes when they are the source of the state or catalog. Also, much of the parsing logic for how to handle catalogs was able to be moved out of the processor and cache classes, and into the CatalogProvider class - for instance, getting the primary keys of a stream or getting a stream's properties.

Refactor CatalogManager to isolate StateManager functions in a new class.
Create abstract base classes of both of the above so we can decouple them from their SQL-based serialization implementations.
Create simple versions of these two classes that simply accept input from upstream.
Allow creation of SQLProcessor objects with State and Catalog artifacts provided explicitly, rather than being read/written to internal SQL tables.
Move classes that we want to move to the CDK into a new _future_cdk module. These would be internal classes anyway, but the goal is to move these abstract implementations upstream so destinations can depend upon them.

Decoupling `SQLProcessor` classes from `SQLBackend` and `Cache` classes.

We want to be able to use SQLProcessor classes in destinations like the Cortext destination, and so this PR decouples SQLProcessor from Cache and Backend classes (previously the CatalogManager). Now, instead of passing a backend to the SQLProcessor, we simply pass a CatalogProvider and StateWriter. When a state provider is not explicitly created, the SQLProcessor class will just create its own StdOutStateWriter class, and will behave like a destination. (See the Cortext sample script for an example that doesn't require a cache.)

One last change was that, since we don't want SQLProcessor to depend on Cache class, we needed a different way to provide the inputs that previously were mapped to an embedded cache property in the SQLProcessor classes. Now, we decouple the functionality of configuration into a new set of SQLConfig classes - these are basically the just the user inputs that would be provided via the Cache, but without the behavioral traits of a full Cache class. SQLConfig objects know their properties and they know how to create SQLAlchemy connections (as well as 3rd-party vendor connections, when applicable), but they are lightweight and can be handed down to the SQLProcessor classes. They can also be imported or implemented by destinations, without the need to create a full cache class.

For backwards compatibility and ease-of-use, the Cache classes inherit from the respective SQLConfig classes, so they accept the same config inputs in their constructor as they did previous to this PR (no breaking change) and they can pass themselves to the SQLProcessor classes as a valid subclass of SQLConfig. Typing prevents the processor from performing cache-related functions on the SQLConfig instances, even though the object may also be a Cache. This ensures that we don't end up with any circular dependencies between the role of the Cache and the SQLProcessor, and it also ensures that destination connectors that use the SQLProcessor class can send the lighter-weight SQLConfig object to the SQLProcessor initializer, and they won't need to create a full Cache object.

Eligible for CDK Re-Use

RecordProcessor
SQLProcessor
CatalogProvider
StateProvider
StateWriter
StdOutStateWriter

…ract base class elements

airbyte/_future_cdk/__init__.py

airbyte/_future_cdk/catalog_manager.py

bindipankhudi · 2024-05-08T18:05:02Z

Finished taking a pass at the code. The refactoring is easy to follow and makes sense to me. It is def much more intuitive and cleaner to have a separate statemanager and catalogmanager. This would make the cortex processor implementation much cleaner!!

aaronsteers · 2024-05-14T20:57:03Z

Tests are passing except one new issue with Windows paths. That will be fixed shortly, but it doesn't change anything in the core code.

examples/run_snowflake_cortex_test_data.py

airbyte/_future_cdk/record_processor.py

airbyte/_future_cdk/sql_processor.py

airbyte/caches/_catalog_backend.py

bindipankhudi

Looks great to me! Thank you for the PR walkthrough! 👍

…ger than 10 minutes

aaronsteers · 2024-05-15T18:59:32Z

/poetry-lock

poetry lock job started... Check job output.

✅ poetry lock applied successfully.

aaronsteers · 2024-05-15T19:20:54Z

/poetry-lock

poetry lock job started... Check job output.

✅ poetry lock applied successfully.

…s of the tests

aaronsteers added 12 commits May 7, 2024 15:57

refactor: move sql processor to _future_cdk

03fbad2

fix refs

d97789e

decouple state manager from catalog manager, and refactor common abst…

f9f9a25

…ract base class elements

finish refactor abstract state manager

32ae58b

continued refactoring

3d4a43d

continued refactoring

81b0181

fix properties

6e306e8

fix abstract as abstract

e1bee09

rename get_configured_catalog_info->get_configured_stream_info

6e6115b

more fixes + cleanup

f3ec86a

more refactoring

ebbf048

fix import issue

99929c3

bindipankhudi reviewed May 8, 2024

View reviewed changes

airbyte/_future_cdk/__init__.py Show resolved Hide resolved

bindipankhudi reviewed May 8, 2024

View reviewed changes

airbyte/_future_cdk/catalog_manager.py Outdated Show resolved Hide resolved

aaronsteers added 15 commits May 8, 2024 11:43

small tweaks

7fbb9b4

minor cleanup

1b53ea6

refactor "manager" to "backend"

e623fcc

move 'register_source' out of processor class

105260d

clean up state writer

3825666

fix some failures

79cf514

fix up state writer

a37dc4a

working sync 🎉

7100535

add poe and coverage as dev dependencies

acd05c5

docs: add contributor guide for testing coverage

5ae297b

ci: upload coverage reports as github ci artifacts

8e89561

add tasks: coverage-report, coverage-html

0960b19

ci: always upload test coverage reports

d33923a

fix stale test ref

eb1a033

more poe tasks for check and fix

44eb95a