-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change(taps): Don't resend SCHEMA
messages if they match the most recent schema sent
#1061
Comments
@edgarrmondragon - Can you confirm that the current behavior is accurately described above? |
SCHEMA
messages if they match the most recent schema sentSCHEMA
messages if they match the most recent schema sent
@aaronsteers Maybe we could make it clear that it's the child's |
@edgarrmondragon - That could probably work... I think generally the methods to send I was leaning towards a class-level cache or a global cache for deduping, since that keeps the implementation mostly unchanged otherwise. |
@aaronsteers a class-level cache makes sense, maybe by hashing the schema in JSON string form with sorted keys? Otherwise hashing a dictionary may have bad performance and not be reliable. |
For a higher-frequency message, I would be more wary of performance. But because In terms of reliability, there may be better means, but in the past I've had good experience with |
Monkey-patching solution class BaseStream(Stream):
def _write_state_message(self) -> None:
pass My problem is with |
Just chiming in here after hitting this problem with |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Feature scope
Taps (catalog, state, stream maps, etc.)
Problem Description
For taps with parent-child relationships, I believe we currently send a new
SCHEMA
message at the start of each child stream from a given context.For streams that have a relatively low number of child items per parent ('low' meaning anything less than 1K or 10K), the additional
SCHEMA
messages may be unnecessarily hinting to targets that they should flush/drain their caches.Proposal Description
The proposal here would be to maintain some type of cache of last-sent
SCHEMA
message, perstream_name
orstream_alias
(if stream maps are applied) and then to skip sending theSCHEMA
message if the new schema matches exactly to the last one that was sent.Target-side mitigation
While the sending of extra
SCHEMA
messages in the tap is problematic in many cases because it triggers unwanted behaviors in the targets, there's a target-side mitigation as well: which is simply to perform the same checks on the target side and only flush batches if the newSCHEMA
message is different from the prior STATE message received for that stream name.The text was updated successfully, but these errors were encountered: